CMU 11492/11692 Spring 2023: Speech Translation

In this demonstration, we will show you some demonstrations of speech translation systems in ESPnet.

Main references: - ESPnet repository - ESPnet documentation - ESPnet-ST-v2 demo - ESPnet-ST repo (WIP)

Author: - Jiatong Shi (jiatongs@andrew.cmu.edu)

Objectives

After this demonstration, you are expected to understand some latest advancements in speech translation.

❗Important Notes❗

  • We are using Colab to show the demo. However, Colab has some constraints on the total GPU runtime. If you use too much GPU time, you may not be able to use GPU for some time.

  • There are multiple in-class checkpoints ✅ throughout this tutorial. Your participation points are based on these tasks. Please try your best to follow all the steps! If you encounter issues, please notify the TAs as soon as possible so that we can make an adjustment for you.

  • Please submit PDF files of your completed notebooks to Gradescope. You can print the notebook using File -> Print in the menu bar.

ESPnet installation (Inference vesion)

Different from previous assignment where we install the full version of ESPnet, we use a lightweight ESPnet package, which mainly designed for inference purpose. The installation with the light version can be much faster than a full installation. Noted that this is an active on-going work in ESPnet. The codebase is still in merging, so we will use a branch from our development fork for this assignment.

[ ]:
!pip install typeguard==2.13.3
!git clone --depth 5 -b merge_s2st_st https://github.com/ftshijt/espnet.git
!cd espnet && pip install .
!pip install -q espnet_model_zoo

We also have some other toolkits/packages needed for this assignment.

[ ]:
!pip install --upgrade --no-cache-dir gdown
!git clone --depth 1 https://github.com/kan-bayashi/ParallelWaveGAN.git
!cd ParallelWaveGAN && pip install .
!pip install pysndfile
!pip install sacrebleu
!pip install mosestokenizer
!git clone https://github.com/facebookresearch/SimulEval.git
!cd SimulEval && pip install -e .

Speech Translation

Speech translation is a typical task that translate speech in a language into text/speech in another language. In this tutorial, we will show you the some latest models (in ESPnet-ST-v2) in the field of speech translation and demonstrate using them in different scenarios, including

  • offline speech-to-text translation

  • simultaneous speech-to-text translation

  • speech-to-speech translation

Overview of the ESPnet-ST-v2

ESPnet-ST-v2 is a revamp of the open-source ESPnet-ST toolkit necessitated by the broadening interests of the spoken language translation community. ESPnet-ST-v2 supports 1) offline speech-to-text translation (ST), 2) simultaneous speech-to-text translation (SST), and 3) offline speech-to-speech translation (S2ST) – each task is supported with a wide variety of approaches, differentiating ESPnet-ST-v2 from other open source spoken language translation toolkits. This toolkit offers state-of-the-art architectures such as transducers, hybrid CTC/attention, multi-decoders with searchable intermediates, time-synchronous blockwise CTC/attention, Translatotron models, and direct discrete unit models.

picture

In general, the toolkit is organizd in a pythonic way to support model training/inference, while we also provide recipes for data preparation, model training, and evaluation.

pitcture

1. Offline Speech-to-text Translation (ST)

1.1 Model download

[ ]:
# Download pretrained st model
!gdown 1Sn2rAZXVSm1hrCj5OIlq61EgbjKXNGdq
!unzip -o st_train_st_ctc_md_conformer_asrinit_v3_noamp_batch50m_ctcsamp0.1_lr1e-3_raw_en_es_bpe_tc4000_sp_valid.acc.ave.zip

1.2 Model Setup

[ ]:
import time
import torch
import string
from espnet2.bin.st_inference import Speech2Text

lang="es"
fs = 16000

speech2text = Speech2Text(
    st_model_file="/content/exp/st_train_st_ctc_md_conformer_asrinit_v3_noamp_batch50m_ctcsamp0.1_lr1e-3_raw_en_es_bpe_tc4000_sp/valid.acc.ave_10best.pth",
    st_train_config="/content/exp/st_train_st_ctc_md_conformer_asrinit_v3_noamp_batch50m_ctcsamp0.1_lr1e-3_raw_en_es_bpe_tc4000_sp/config.yaml",
    beam_size=10,
    ctc_weight=0.3,
    asr_beam_size=10,
    asr_ctc_weight=0.3,
    device="cuda",
)

1.3 Translate our example recordings

[ ]:
!git clone https://github.com/ftshijt/ESPnet_st_egs.git
[ ]:
import torch
import pandas as pd
import soundfile as sf
import librosa.display
from IPython.display import display, Audio
import matplotlib.pyplot as plt
from sacrebleu.metrics import BLEU

bleu = BLEU()

egs = pd.read_csv("ESPnet_st_egs/st/egs.csv")
for index, row in egs.iterrows():
  if row["lang"] == lang or lang == "multilingual":
    speech, rate = sf.read("ESPnet_st_egs/" + row["path"])
    assert fs == int(row["sr"])
    text, _, _, _ = speech2text(speech)[0][0]
    display(Audio(speech, rate=fs))
    librosa.display.waveplot(speech, sr=fs)
    plt.show()
    print(f"Reference source text: {row['src_text']}")
    print(f"Translation results: {text}")
    print(f"Reference target text: {row['tgt_text']}")
    print(f"Sentence BLEU Score: {bleu.sentence_score(text, [row['tgt_text']])}")
    print("*" * 50)

Task1 (✅ Checkpoint 1 (2 point))

We have printout the sentence BLEU score of the model. Can you compute the corpus BLEU with examples in sacreBLEU based on the five utterances in the example?

[ ]:
# CHECKPOINT1

refs = [[
    "Acabo de regresar de una comunidad que tiene el secreto de la supervivencia humana .",
    "En última instancia , avanzar , creo que tenemos que darle lugar al miedo .",
    "Cuando recién ingresaba a la universidad tuve mi primer clase de biología .",
    "Comparte sus experiencias con ellos .",
    "Cada vez que estén de vacaciones y alguien colapse , puede ser un pariente o alguien enfrente de Uds. , pueden encontrarlo .",
]]

# Please fill the translation results to here
hyps = [

]

# Please compute the corpus bleu score

1.4 Translate your own live-recordings

  1. Record your own voice

  2. Tralsate your vocie with the ST system

[ ]:
# from https://gist.github.com/korakot/c21c3476c024ad6d56d5f48b0bca92be

from IPython.display import Javascript
from google.colab import output
from base64 import b64decode

RECORD = """
const sleep = time => new Promise(resolve => setTimeout(resolve, time))
const b2text = blob => new Promise(resolve => {
  const reader = new FileReader()
  reader.onloadend = e => resolve(e.srcElement.result)
  reader.readAsDataURL(blob)
})
var record = time => new Promise(async resolve => {
  stream = await navigator.mediaDevices.getUserMedia({ audio: true })
  recorder = new MediaRecorder(stream)
  chunks = []
  recorder.ondataavailable = e => chunks.push(e.data)
  recorder.start()
  await sleep(time)
  recorder.onstop = async ()=>{
    blob = new Blob(chunks)
    text = await b2text(blob)
    resolve(text)
  }
  recorder.stop()
})
"""

def record(sec, filename='audio.wav'):
  display(Javascript(RECORD))
  s = output.eval_js('record(%d)' % (sec * 1000))
  b = b64decode(s.split(',')[1])
  with open(filename, 'wb+') as f:
    f.write(b)

audio = 'audio.wav'
second = 5
print(f"Speak to your microphone {second} sec...")
record(second, audio)
print("Done!")


import librosa
import librosa.display
speech, rate = librosa.load(audio, sr=16000)
librosa.display.waveplot(speech, sr=rate)

import matplotlib.pyplot as plt
plt.show()

import pysndfile
pysndfile.sndio.write('audio_ds.wav', speech, rate=rate, format='wav', enc='pcm16')

from IPython.display import display, Audio
display(Audio(speech, rate=rate))

Task2 (✅ Checkpoint 2 (1 point))

Please follow the same procedure as previous examples and print out the translation results. (You can directly use the speech2text function)

[ ]:
# [CHECKPOINT2]
# Follow the same procedure as previous examples, print out the translation results

2. Simultaneous Speech-to-text Translation (SST)

[ ]:
# Download retrained sst model
!gdown 1ekUeMvmaB3ZhAIY_KAb_we1zhIRFZhtu
!unzip -o /content/st_train_st_ctc_conformer_asrinit_v2_streaming_40block_nohier_18lyr_raw_en_es_bpe_tc4000_sp_valid.acc.ave.zip
[ ]:
import time
import torch
import string
from espnet2.bin.st_inference_streaming import Speech2TextStreaming

lang="es"
fs = 16000

speech2textstreaming = Speech2TextStreaming(
    st_model_file="/content/exp/st_train_st_ctc_conformer_asrinit_v2_streaming_40block_nohier_18lyr_raw_en_es_bpe_tc4000_sp/valid.acc.ave_10best.pth",
    st_train_config="/content/exp/st_train_st_ctc_conformer_asrinit_v2_streaming_40block_nohier_18lyr_raw_en_es_bpe_tc4000_sp/config.yaml",
    penalty=0.4,
    blank_penalty=0.5,
    beam_size=10,
    ctc_weight=0.5,
    incremental_decode=True,
    time_sync=True,
    device="cuda",
)
[ ]:
import torch
import pandas as pd
import soundfile as sf
import librosa.display
from IPython.display import display, Audio
import matplotlib.pyplot as plt
from sacrebleu.metrics import BLEU

bleu = BLEU()

egs = pd.read_csv("ESPnet_st_egs/st/egs.csv")
for index, row in egs.iterrows():
  if row["lang"] == lang or lang == "multilingual":
    speech, rate = sf.read("ESPnet_st_egs/" + row["path"])
    assert fs == int(row["sr"])
    text = speech2textstreaming(speech)[0][0]
    display(Audio(speech, rate=fs))
    librosa.display.waveplot(speech, sr=fs)
    plt.show()
    print(f"Reference source text: {row['src_text']}")
    print(f"Translation results: {text}")
    print(f"Reference target text: {row['tgt_text']}")
    print(f"Sentence BLEU Score: {bleu.sentence_score(text, [row['tgt_text']])}")
    print("*" * 50)

Question3 (✅ Checkpoint 3 (1 point))

How is the performance of the streaming model compared to the offline model? Could you provide some explanation on the performances differences?

(For question-based checkpoint: please directly answer it in the text box)

[YOUR ANSWER HERE]

[ ]:
!simuleval --source /content/ESPnet_st_egs/st/wav.scp --target /content/ESPnet_st_egs/st/ref.detok.trn --agent /content/espnet/egs2/TEMPLATE/st1/pyscripts/utils/simuleval_agent.py --batch_size 1 --ngpu 0 --st_train_config /content/exp/st_train_st_ctc_conformer_asrinit_v2_streaming_40block_nohier_18lyr_raw_en_es_bpe_tc4000_sp/config.yaml --st_model_file exp/st_train_st_ctc_conformer_asrinit_v2_streaming_40block_nohier_18lyr_raw_en_es_bpe_tc4000_sp/valid.acc.ave_10best.pth --disable_repetition_detection false --beam_size 10 --sim_chunk_length 2048 --backend streaming --ctc_weight 0.5 --incremental_decode true --penalty 0.4 --blank_penalty 0.7 --time_sync true --latency-metrics LAAL AL AP DAL

Question4 (✅ Checkpoint 4 (1 point))

Despite from BLEU, we have LAACL, AL, AP, DAL for evaluation. AL (average lagging) is one of the most widely used metrics in recent works. Please use one sentence to describe what AL is.

(For question-based checkpoint: please directly answer it in the text box)

[YOUR ANSWER HERE]