CMU 11492/11692 Spring 2023: Spoken Language Understanding¶

In this demonstration, we will show you the procedure to conduct spoken language understanding in ESPnet.

Main references: - ESPnet repository - ESPnet documentation

Author: - Siddhant Arora (siddhana@andrew.cmu.edu)

Objectives¶

After this demonstration, you are expected to understand some latest advancements in spoken language understanding.

❗Important Notes❗¶

We are using Colab to show the demo. However, Colab has some constraints on the total GPU runtime. If you use too much GPU time, you may not be able to use GPU for some time.
There are multiple in-class checkpoints ✅ throughout this tutorial. Your participation points are based on these tasks. Please try your best to follow all the steps! If you encounter issues, please notify the TAs as soon as possible so that we can make an adjustment for you.
Please submit PDF files of your completed notebooks to Gradescope. You can print the notebook using File -> Print in the menu bar.

ESPnet installation¶

We follow the ESPnet installation as the previous tutorials (takes around 15 minutes).

[ ]:

! python -m pip install transformers
!git clone https://github.com/espnet/espnet /espnet
!pip install /espnet
%pip install -q espnet_model_zoo
%pip install fairseq@git+https://github.com//pytorch/fairseq.git@f2146bdc7abf293186de9449bfa2272775e39e1d#egg=fairseq

Spoken Language Understanding¶

Spoken Language Understanding (SLU) refers to the task of extracting semantic meaning or linguistic structure from spoken utterances. Some examples include recognizing the intent and their associated entities of a user’s command to take appropriate action, or even understanding the emotion behind a particular utterance, and engaging in conversations with a user by modeling the topic of a conversation. SLU is an essential component of many commercial applications like voice assistants, social bots, and intelligent home devices which have to map speech signals to executable commands every day.

Conventional SLU systems employ a cascaded approach for sequence labeling, where an automatic speech recognition (ASR) system first recognizes the spoken words from the input audio and a natural language understanding (NLU) system then extracts the intent from the predicted text. These cascaded approaches can effectively utilize pretrained ASR and NLU systems. However, they suffer from error propagation as errors in the ASR transcripts can adversely affect downstream SLU performance. Consequently, in this demo, we focus on end-to-end (E2E) SLU systems. E2E SLU systems aim to predict intent directly from speech. These E2E SLU systems can avoid the cascading of errors but cannot directly utilize strong acoustic and semantic representations from pretrained ASR systems and language models.

In this tutorial, we will show you some latest E2E SLU model architectures (in ESPnet-SLU) in the field of spoken language understanding, including

E2E SLU (https://arxiv.org/abs/2111.14706)
Two Pass E2E SLU (https://arxiv.org/abs/2207.06670)

Overview of the ESPnet-SLU¶

As ASR systems are getting better, there is an increasing interest in using the ASR output directly to do downstream Natural Language Processing (NLP) tasks. With the increase in SLU datasets and methodologies proposed, ESPnet-SLU is an open-source SLU toolkit built on an already existing open-source speech processing toolkit ESPnet. ESPnet-SLU standardize the pipelines involved in building an SLU model like data preparation, model training, and its evaluation. Having ESPnet-SLU would help users build systems for real world scenarios where many speech processing steps need to be applied before running the downstream task. ESPnet also provides an easy access to other speech technologies being developed like data augmentation, encoder sub-sampling, and speech-focused encoders like conformers. They also support many pretrained ASR and NLU systems that can be used as feature extractors in a SLU framework.

We have shown a sample architecure of our E2E SLU Model in the figure below:

1. E2E SLU¶

1.1 Download Sample Audio File¶

[ ]:

!gdown --id 18ANT62ittt7Ai2E8bQRlvT0ZVXXsf1eE -O /content/audio_file.wav
import os

import soundfile
from IPython.display import display, Audio
mixwav_mc, sr = soundfile.read("/content/audio_file.wav")
display(Audio(mixwav_mc.T, rate=sr))

Question1 (✅ Checkpoint 1 (1 points))¶

Run inference on given audio using E2E SLU for intent classification

1.2 Download and Load pretrained E2E SLU Model¶

[ ]:

!git lfs clone https://huggingface.co/espnet/siddhana_slurp_new_asr_train_asr_conformer_raw_en_word_valid.acc.ave_10best /content/slurp_first_pass_model
from espnet2.bin.asr_inference import Speech2Text
speech2text_slurp = Speech2Text.from_pretrained(
    asr_train_config="/content/slurp_first_pass_model/exp/asr_train_asr_conformer_raw_en_word/config.yaml",
    asr_model_file="/content/slurp_first_pass_model/exp/asr_train_asr_conformer_raw_en_word/valid.acc.ave_10best.pth",
    nbest=1,
)

[ ]:

nbests_orig = speech2text_slurp(mixwav_mc)
text, *_ = nbests_orig[0]
def text_normalizer(sub_word_transcript):
    transcript = sub_word_transcript[0].replace("▁", "")
    for sub_word in sub_word_transcript[1:]:
        if "▁" in sub_word:
            transcript = transcript + " " + sub_word.replace("▁", "")
        else:
            transcript = transcript + sub_word
    return transcript
intent_text="{scenario: "+text.split()[0].split("_")[0]+", action: "+"_".join(text.split()[0].split("_")[1:])+"}"
print(f"INTENT: {intent_text}")
transcript=text_normalizer(text.split()[1:])
print(f"ASR hypothesis: {transcript}")
print(f"E2E SLU model fails to predict the correct action.")

2. Two Pass E2E SLU¶

However, recent work has shown that E2E-SLU systems struggle to generalize to unique phrasing for the same intent, suggesting an opportunity for enhancing semantic modeling of existing SLU systems. A number of approaches have been proposed to learn semantic content directly from audio. These approaches aim to incorporate pretrained language models to improve semantic processing of SLU architectures. In this demo, we use the Two Pass E2E SLU model where the second pass model improves on the initial prediction by combining acoustic information from the entire speech and semantic information from ASR-hypothesis using a deliberation network.

pitcture

Question2 (✅ Checkpoint 2 (1 points))¶

Run inference on given audio using 2 pass SLU

[ ]:

!git lfs clone https://huggingface.co/espnet/slurp_slu_2pass /content/slurp_second_pass_model

[ ]:

from espnet2.bin.slu_inference import Speech2Understand
from transformers import AutoModel, AutoTokenizer
speech2text_second_pass_slurp = Speech2Understand.from_pretrained(
    slu_train_config="/content/slurp_second_pass_model/exp/slu_train_asr_bert_conformer_deliberation_raw_en_word/config.yaml",
    slu_model_file="/content/slurp_second_pass_model/exp/slu_train_asr_bert_conformer_deliberation_raw_en_word/valid.acc.ave_10best.pth",
    nbest=1,
)

[ ]:

from espnet2.tasks.slu import SLUTask
preprocess_fn=SLUTask.build_preprocess_fn(
            speech2text_second_pass_slurp.asr_train_args, False
        )
import numpy as np
transcript = preprocess_fn.text_cleaner(transcript)
tokens = preprocess_fn.transcript_tokenizer.text2tokens(transcript)
text_ints = np.array(preprocess_fn.transcript_token_id_converter.tokens2ids(tokens), dtype=np.int64)

[ ]:

import torch
nbests = speech2text_second_pass_slurp(mixwav_mc,torch.tensor(text_ints))
text1, *_ = nbests[0]
intent_text="{scenario: "+text1.split()[0].split("_")[0]+", action: "+"_".join(text1.split()[0].split("_")[1:])+"}"
print(f"INTENT: {intent_text}")
transcript=text_normalizer(text1.split()[1:])
print(f"ASR hypothesis: {transcript}")
print(f"Second pass SLU model successfully recognizes the correct action.")

3. E2E SLU for Slot Filling¶

Question3 (✅ Checkpoint 3 (1 point))¶

Run inference on given audio using E2E SLU for slot filling

[ ]:

!gdown --id 1ezs8IPutLr-C0PXKb6pfOlb6XXFDXcPd -O /content/audio_slurp_entity_file.wav
import os

import soundfile
from IPython.display import display, Audio
mixwav_mc, sr = soundfile.read("/content/audio_slurp_entity_file.wav")
display(Audio(mixwav_mc.T, rate=sr))

[ ]:

!git lfs clone https://huggingface.co/espnet/siddhana_slurp_entity_asr_train_asr_conformer_raw_en_word_valid.acc.ave_10best /content/slurp_entity_model
from espnet2.bin.asr_inference import Speech2Text
speech2text_slurp = Speech2Text.from_pretrained(
    asr_train_config="/content/slurp_entity_model/exp/asr_train_asr_conformer_raw_en_word/config.yaml",
    asr_model_file="/content/slurp_entity_model/exp/asr_train_asr_conformer_raw_en_word/valid.acc.ave_10best.pth",
    nbest=1,
)

[ ]:

nbests_orig = speech2text_slurp(mixwav_mc)
text, *_ = nbests_orig[0]

[ ]:

def entity_text_normalizer(sub_word_transcript_list):
    transcript_dict={}
    for sub_word_transcript_new in sub_word_transcript_list:
      sub_word_transcript=sub_word_transcript_new.split()
      # print(sub_word_transcript_list)
      # print(sub_word_transcript)
      transcript = sub_word_transcript[0].replace("▁", "")
      for sub_word in sub_word_transcript[1:]:
          if "▁" in sub_word:
              transcript = transcript + " " + sub_word.replace("▁", "")
          else:
              transcript = transcript + sub_word
      transcript_dict[transcript.split(" FILL ")[0]]=transcript.split(" FILL ")[1]
    return transcript_dict
intent_text="{scenario: "+text.split()[0].split("_")[0]+", action: "+"_".join(text.split()[0].split("_")[1:])+"}"
# print(text)
print(f"INTENT: {intent_text}")
# print(" ".join(text.split()[1:]).split("▁SEP")[-1].split())
transcript=text_normalizer(" ".join(text.split()[1:]).split("▁SEP")[-1].split())
print(f"ASR hypothesis: {transcript}")
entity_transcript=entity_text_normalizer(" ".join(text.split()[1:]).split("▁SEP")[1:-1])
print(f"Slot dictionary: {entity_transcript}")

4. E2E SLU for Sentiment Analysis¶

Question4 (✅ Checkpoint 4 (1 point))¶

Run inference on given audio using E2E SLU for sentiment analysis

[ ]:

!gdown --id 1CZzmpMliwSzja9TdBV7wmidlGepZBEUi -O /content/audio_iemocap_file.wav
import os

import soundfile
from IPython.display import display, Audio
mixwav_mc, sr = soundfile.read("/content/audio_iemocap_file.wav")
display(Audio(mixwav_mc.T, rate=sr))

[ ]:

!git lfs clone https://huggingface.co/espnet/YushiUeda_iemocap_sentiment_asr_train_asr_conformer /content/iemocap_model
from espnet2.bin.asr_inference import Speech2Text
speech2text_iemocap = Speech2Text.from_pretrained(
    asr_train_config="/content/iemocap_model/exp/asr_train_asr_conformer_raw_en_word/config.yaml",
    asr_model_file="/content/iemocap_model/exp/asr_train_asr_conformer_raw_en_word/valid.acc.ave_10best.pth",
    nbest=1,
)

[ ]:

nbests_orig = speech2text_iemocap(mixwav_mc)
text, *_ = nbests_orig[0]
sentiment_text=text.split()[0]
print(f"SENTIMENT: {sentiment_text}")

Question5 (✅ Checkpoint 5 (1 point))¶

Discuss about potential advantages of integrating pre-trained LMs inside E2E SLU framework compared to using them in cascaded manner?

[ANSWER HERE]