{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "k8GMfsH2mRND" }, "source": [ "# CMU 11492/11692 Spring 2023: Spoken Language Understanding\n", "\n", "In this demonstration, we will show you the procedure to conduct spoken language understanding in ESPnet.\n", "\n", "Main references:\n", "- [ESPnet repository](https://github.com/espnet/espnet)\n", "- [ESPnet documentation](https://espnet.github.io/espnet/)\n", "\n", "Author:\n", "- Siddhant Arora (siddhana@andrew.cmu.edu)\n", "\n", "## Objectives\n", "\n", "After this demonstration, you are expected to understand some latest advancements in spoken language understanding.\n", "\n", "## ❗Important Notes❗\n", "- We are using Colab to show the demo. However, Colab has some constraints on the total GPU runtime. If you use too much GPU time, you may not be able to use GPU for some time.\n", "- There are multiple in-class checkpoints ✅ throughout this tutorial. **Your participation points are based on these tasks.** Please try your best to follow all the steps! If you encounter issues, please notify the TAs as soon as possible so that we can make an adjustment for you.\n", "- Please submit PDF files of your completed notebooks to Gradescope. You can print the notebook using `File -> Print` in the menu bar." ] }, { "cell_type": "markdown", "metadata": { "id": "1Ftl74aZnBrH" }, "source": [ "## ESPnet installation\n", "\n", "We follow the ESPnet installation as the previous tutorials (takes around 15 minutes)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "3p6lJ3YNmArH" }, "outputs": [], "source": [ "! python -m pip install transformers\n", "!git clone https://github.com/espnet/espnet /espnet\n", "!pip install /espnet\n", "%pip install -q espnet_model_zoo\n", "%pip install fairseq@git+https://github.com//pytorch/fairseq.git@f2146bdc7abf293186de9449bfa2272775e39e1d#egg=fairseq" ] }, { "cell_type": "markdown", "metadata": { "id": "C0G94uP6qVTz" }, "source": [ "## Spoken Language Understanding\n", "Spoken Language Understanding (SLU) refers to the task of extracting semantic meaning or linguistic structure from spoken utterances. Some examples include recognizing the intent and their associated entities of a user’s command to take appropriate action, or even understanding the emotion behind a particular utterance, and engaging in conversations\n", "with a user by modeling the topic of a conversation. SLU is an essential component of many commercial applications like voice assistants, social bots, and intelligent home devices which have to map speech signals to executable commands every\n", "day.\n", "\n", "Conventional SLU systems employ a cascaded approach for sequence labeling, where an automatic speech recognition (ASR) system first recognizes the spoken words from the input audio and a natural language understanding (NLU) system then extracts the intent from the predicted text. These cascaded approaches can effectively utilize pretrained ASR and NLU systems. However, they suffer from error propagation as errors in the ASR transcripts can adversely affect downstream SLU performance. Consequently, in this demo, we focus on end-to-end (E2E) SLU systems. E2E SLU systems aim to predict intent directly from speech. These E2E SLU systems can avoid the cascading of errors but cannot directly utilize strong acoustic and semantic representations from pretrained ASR systems and language models. \n", "\n", "In this tutorial, we will show you some latest E2E SLU model architectures (in ESPnet-SLU) in the field of spoken language understanding, including\n", "\n", "- E2E SLU (https://arxiv.org/abs/2111.14706)\n", "- Two Pass E2E SLU (https://arxiv.org/abs/2207.06670)" ] }, { "cell_type": "markdown", "source": [ "## Overview of the ESPnet-SLU\n", "\n", "As ASR systems are getting better, there is an increasing interest in using the ASR output directly to do downstream Natural Language Processing (NLP) tasks. With the increase in SLU datasets and methodologies proposed, ESPnet-SLU is an open-source SLU toolkit built on an already existing open-source speech processing toolkit ESPnet. ESPnet-SLU standardize the pipelines involved in building an SLU model like data preparation, model training, and its evaluation. Having ESPnet-SLU would help users build systems for real world scenarios where many speech processing steps need to be applied before running the downstream task. ESPnet also provides an\n", "easy access to other speech technologies being developed like data augmentation, encoder sub-sampling, and speech-focused\n", "encoders like conformers. They also support many pretrained\n", "ASR and NLU systems that can be used as feature\n", "extractors in a SLU framework.\n", "\n", "We have shown a sample architecure of our E2E SLU Model in the figure below:\n", "\n", "![picture](https://drive.google.com/uc?id=1qzWcOV3x5-cj9OHB-iVtCGfY1tQCWk76)\n" ], "metadata": { "id": "cEx4BlXrv7fz" } }, { "cell_type": "markdown", "source": [ "## 1. E2E SLU\n" ], "metadata": { "id": "wXaFrm070-90" } }, { "cell_type": "markdown", "source": [ "### 1.1 Download Sample Audio File" ], "metadata": { "id": "mH4LphS7_g8P" } }, { "cell_type": "code", "source": [ "!gdown --id 18ANT62ittt7Ai2E8bQRlvT0ZVXXsf1eE -O /content/audio_file.wav\n", "import os\n", "\n", "import soundfile\n", "from IPython.display import display, Audio\n", "mixwav_mc, sr = soundfile.read(\"/content/audio_file.wav\")\n", "display(Audio(mixwav_mc.T, rate=sr))" ], "metadata": { "id": "ek_XkPaD1qTF" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "### Question1 (✅ Checkpoint 1 (1 points))\n", "Run inference on given audio using E2E SLU for intent classification" ], "metadata": { "id": "5YGsTrCkBchx" } }, { "cell_type": "markdown", "source": [ "### 1.2 Download and Load pretrained E2E SLU Model" ], "metadata": { "id": "3abHn-6m_7p-" } }, { "cell_type": "code", "source": [ "!git lfs clone https://huggingface.co/espnet/siddhana_slurp_new_asr_train_asr_conformer_raw_en_word_valid.acc.ave_10best /content/slurp_first_pass_model\n", "from espnet2.bin.asr_inference import Speech2Text\n", "speech2text_slurp = Speech2Text.from_pretrained(\n", " asr_train_config=\"/content/slurp_first_pass_model/exp/asr_train_asr_conformer_raw_en_word/config.yaml\",\n", " asr_model_file=\"/content/slurp_first_pass_model/exp/asr_train_asr_conformer_raw_en_word/valid.acc.ave_10best.pth\",\n", " nbest=1,\n", ")" ], "metadata": { "id": "6h58qX3rAGlV" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "nbests_orig = speech2text_slurp(mixwav_mc)\n", "text, *_ = nbests_orig[0]\n", "def text_normalizer(sub_word_transcript):\n", " transcript = sub_word_transcript[0].replace(\"▁\", \"\")\n", " for sub_word in sub_word_transcript[1:]:\n", " if \"▁\" in sub_word:\n", " transcript = transcript + \" \" + sub_word.replace(\"▁\", \"\")\n", " else:\n", " transcript = transcript + sub_word\n", " return transcript\n", "intent_text=\"{scenario: \"+text.split()[0].split(\"_\")[0]+\", action: \"+\"_\".join(text.split()[0].split(\"_\")[1:])+\"}\"\n", "print(f\"INTENT: {intent_text}\")\n", "transcript=text_normalizer(text.split()[1:])\n", "print(f\"ASR hypothesis: {transcript}\")\n", "print(f\"E2E SLU model fails to predict the correct action.\")" ], "metadata": { "id": "Gx8hkEp5AQFj" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "## 2. Two Pass E2E SLU\n", "\n", "\n", "\n", "However, recent work has shown that E2E-SLU systems\n", "struggle to generalize to unique phrasing for the same intent,\n", "suggesting an opportunity for enhancing semantic modeling of\n", "existing SLU systems. A number of approaches have\n", "been proposed to learn semantic content directly from audio.\n", "These approaches aim to incorporate pretrained language models to improve semantic processing of SLU architectures. \n", "In this demo, we use the Two Pass E2E SLU model where\n", "the second pass model improves on the initial prediction by combining acoustic information from the entire speech and semantic information from ASR-hypothesis using a deliberation network.\n", "\n", "![pitcture](https://drive.google.com/uc?id=1imEA98mIqcC6i-Cgdc84msHKliaVgtdf)" ], "metadata": { "id": "cP12SPRzAcGn" } }, { "cell_type": "markdown", "source": [ "### Question2 (✅ Checkpoint 2 (1 points))\n", "Run inference on given audio using 2 pass SLU" ], "metadata": { "id": "mgqhhQRtBj0h" } }, { "cell_type": "code", "source": [ "!git lfs clone https://huggingface.co/espnet/slurp_slu_2pass /content/slurp_second_pass_model" ], "metadata": { "id": "xjYZskNkncPg" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "from espnet2.bin.slu_inference import Speech2Understand\n", "from transformers import AutoModel, AutoTokenizer\n", "speech2text_second_pass_slurp = Speech2Understand.from_pretrained(\n", " slu_train_config=\"/content/slurp_second_pass_model/exp/slu_train_asr_bert_conformer_deliberation_raw_en_word/config.yaml\",\n", " slu_model_file=\"/content/slurp_second_pass_model/exp/slu_train_asr_bert_conformer_deliberation_raw_en_word/valid.acc.ave_10best.pth\",\n", " nbest=1,\n", ")" ], "metadata": { "id": "bTNDYcf5npxi" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "from espnet2.tasks.slu import SLUTask\n", "preprocess_fn=SLUTask.build_preprocess_fn(\n", " speech2text_second_pass_slurp.asr_train_args, False\n", " )\n", "import numpy as np\n", "transcript = preprocess_fn.text_cleaner(transcript)\n", "tokens = preprocess_fn.transcript_tokenizer.text2tokens(transcript)\n", "text_ints = np.array(preprocess_fn.transcript_token_id_converter.tokens2ids(tokens), dtype=np.int64)" ], "metadata": { "id": "X5Ha-k5Snumh" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "import torch\n", "nbests = speech2text_second_pass_slurp(mixwav_mc,torch.tensor(text_ints))\n", "text1, *_ = nbests[0]\n", "intent_text=\"{scenario: \"+text1.split()[0].split(\"_\")[0]+\", action: \"+\"_\".join(text1.split()[0].split(\"_\")[1:])+\"}\"\n", "print(f\"INTENT: {intent_text}\")\n", "transcript=text_normalizer(text1.split()[1:])\n", "print(f\"ASR hypothesis: {transcript}\")\n", "print(f\"Second pass SLU model successfully recognizes the correct action.\")" ], "metadata": { "id": "T5qVgHEgny7T" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "## 3. E2E SLU for Slot Filling" ], "metadata": { "id": "CTQ50tysGXFE" } }, { "cell_type": "markdown", "source": [ "### Question3 (✅ Checkpoint 3 (1 point))\n", "Run inference on given audio using E2E SLU for slot filling" ], "metadata": { "id": "ffm5eP2iGZqA" } }, { "cell_type": "code", "source": [ "!gdown --id 1ezs8IPutLr-C0PXKb6pfOlb6XXFDXcPd -O /content/audio_slurp_entity_file.wav\n", "import os\n", "\n", "import soundfile\n", "from IPython.display import display, Audio\n", "mixwav_mc, sr = soundfile.read(\"/content/audio_slurp_entity_file.wav\")\n", "display(Audio(mixwav_mc.T, rate=sr))" ], "metadata": { "id": "cwyo47gSGhUS" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "!git lfs clone https://huggingface.co/espnet/siddhana_slurp_entity_asr_train_asr_conformer_raw_en_word_valid.acc.ave_10best /content/slurp_entity_model\n", "from espnet2.bin.asr_inference import Speech2Text\n", "speech2text_slurp = Speech2Text.from_pretrained(\n", " asr_train_config=\"/content/slurp_entity_model/exp/asr_train_asr_conformer_raw_en_word/config.yaml\",\n", " asr_model_file=\"/content/slurp_entity_model/exp/asr_train_asr_conformer_raw_en_word/valid.acc.ave_10best.pth\",\n", " nbest=1,\n", ")" ], "metadata": { "id": "HktH4uUnG-t5" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "nbests_orig = speech2text_slurp(mixwav_mc)\n", "text, *_ = nbests_orig[0]" ], "metadata": { "id": "lnBnr7h9HRk7" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "def entity_text_normalizer(sub_word_transcript_list):\n", " transcript_dict={}\n", " for sub_word_transcript_new in sub_word_transcript_list:\n", " sub_word_transcript=sub_word_transcript_new.split()\n", " # print(sub_word_transcript_list)\n", " # print(sub_word_transcript)\n", " transcript = sub_word_transcript[0].replace(\"▁\", \"\")\n", " for sub_word in sub_word_transcript[1:]:\n", " if \"▁\" in sub_word:\n", " transcript = transcript + \" \" + sub_word.replace(\"▁\", \"\")\n", " else:\n", " transcript = transcript + sub_word\n", " transcript_dict[transcript.split(\" FILL \")[0]]=transcript.split(\" FILL \")[1]\n", " return transcript_dict\n", "intent_text=\"{scenario: \"+text.split()[0].split(\"_\")[0]+\", action: \"+\"_\".join(text.split()[0].split(\"_\")[1:])+\"}\"\n", "# print(text)\n", "print(f\"INTENT: {intent_text}\")\n", "# print(\" \".join(text.split()[1:]).split(\"▁SEP\")[-1].split())\n", "transcript=text_normalizer(\" \".join(text.split()[1:]).split(\"▁SEP\")[-1].split())\n", "print(f\"ASR hypothesis: {transcript}\")\n", "entity_transcript=entity_text_normalizer(\" \".join(text.split()[1:]).split(\"▁SEP\")[1:-1])\n", "print(f\"Slot dictionary: {entity_transcript}\")" ], "metadata": { "id": "jWRTX_KFHXP7" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "## 4. E2E SLU for Sentiment Analysis" ], "metadata": { "id": "zsgZktluHtdT" } }, { "cell_type": "markdown", "source": [ "### Question4 (✅ Checkpoint 4 (1 point))\n", "Run inference on given audio using E2E SLU for sentiment analysis" ], "metadata": { "id": "aHYhpAuzH6kW" } }, { "cell_type": "code", "source": [ "!gdown --id 1CZzmpMliwSzja9TdBV7wmidlGepZBEUi -O /content/audio_iemocap_file.wav\n", "import os\n", "\n", "import soundfile\n", "from IPython.display import display, Audio\n", "mixwav_mc, sr = soundfile.read(\"/content/audio_iemocap_file.wav\")\n", "display(Audio(mixwav_mc.T, rate=sr))" ], "metadata": { "id": "xSbyBSAVK859" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "!git lfs clone https://huggingface.co/espnet/YushiUeda_iemocap_sentiment_asr_train_asr_conformer /content/iemocap_model\n", "from espnet2.bin.asr_inference import Speech2Text\n", "speech2text_iemocap = Speech2Text.from_pretrained(\n", " asr_train_config=\"/content/iemocap_model/exp/asr_train_asr_conformer_raw_en_word/config.yaml\",\n", " asr_model_file=\"/content/iemocap_model/exp/asr_train_asr_conformer_raw_en_word/valid.acc.ave_10best.pth\",\n", " nbest=1,\n", ")" ], "metadata": { "id": "qw2fGKTmMfRd" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "nbests_orig = speech2text_iemocap(mixwav_mc)\n", "text, *_ = nbests_orig[0]\n", "sentiment_text=text.split()[0]\n", "print(f\"SENTIMENT: {sentiment_text}\")" ], "metadata": { "id": "9uDxGyBuObp7" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "### Question5 (✅ Checkpoint 5 (1 point))\n", "Discuss about potential advantages of integrating pre-trained LMs inside E2E SLU framework compared to using them in cascaded manner? \n", "\n", "\n", "[ANSWER HERE]\n" ], "metadata": { "id": "cHpF5W3ZB3tH" } } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" }, "accelerator": "GPU", "gpuClass": "standard" }, "nbformat": 4, "nbformat_minor": 0 }