ESPnet Speech Translation Demonstration

About 2 min

ESPnet Speech Translation Demonstration

Open In Colab

Install

It takes around 3 minutes. Please keep waiting for a while.

# OS setup
!cat /etc/os-release
!apt-get install -qq bc tree sox

# espnet and moses setup
!git clone -q https://github.com/ShigekiKarita/espnet.git
!pip install -q torch==1.1
!cd espnet; git checkout c0466d9a356c1a33f671a546426d7bc33b5b17e8; pip install -q -e .
!cd espnet/tools/; make moses.done

# download pre-compiled warp-ctc and kaldi tools
!espnet/utils/download_from_google_drive.sh \
    "https://drive.google.com/open?id=13Y4tSygc8WtqzvAVGK_vRV9GlV7TRC0w" espnet/tools tar.gz > /dev/null

# make dummy activate
!mkdir -p espnet/tools/venv/bin && touch espnet/tools/venv/bin/activate
!echo "setup done."

Spanish speech -> English text translation

This audio says "yo soy José."

from IPython.display import display, Audio
display(Audio("/content/espnet/test_utils/st_test.wav", rate=16000))

Let's translate this into English text by our pretrained Transformer ST model trained on the Fisher-CALLHOME Spanish dataset.

# move on the recipe directory
import os
os.chdir("/content/espnet/egs/fisher_callhome_spanish/st1")

!../../../utils/translate_wav.sh --models fisher_callhome_spanish.transformer.v1.es-en ../../../test_utils/st_test.wav | tee /content/translated.txt

As seen above, we successfully obtained the result: "Translated text: yes i'm jose"!

English translated text-to-speech synthesis

Now let's generate an English speech from the translated text using a pretrained ESPnet-TTS model.

!sed -n 's/Translated text://p'  /content/translated.txt | tr '[:lower:]' '[:upper:]' | tee /content/translated_sed.txt
!../../../utils/synth_wav.sh /content/translated_sed.txt

import matplotlib.pyplot as plt
import kaldiio
fbank = next(iter(kaldiio.load_scp("decode/translated_sed/outputs/feats.scp").values()))
plt.matshow(fbank.T)

from IPython.display import display, Audio
display(Audio("decode/translated_sed/wav_wnv/translated_sed_gen.wav"))

Successfully, it says "Yes I'm Jose"! For more TTS demo, visit https://colab.research.google.com/github/espnet/notebook/blob/master/tts_realtime_demo.ipynb

Check decoding log

After the translation, you will find <decode_dir>/<wav name>/result.json for details;

!cat decode/st_test/result.json

and <decode_dir>/<wav name>/log/decode.log for runtime log;

!cat decode/st_test/log/decode.log

Let's calculate real-time factor (RTF) of the ST decoding from the decode.log

from dateutil import parser
from subprocess import PIPE, run

# calc input duration (seconds)
input_sec = float(run(["soxi", "-D", "/content/espnet/test_utils/st_test.wav"], stdout=PIPE).stdout)

# calc NN decoding time
with open("decode/st_test/log/decode.log", "r") as f:
  times = [parser.parse(x.split("(")[0]) for x in f if "e2e_st_transformer" in x]
decode_sec = (times[-1] - times[0]).total_seconds()

# get real-time factor (RTF)
print("Input duration:\t", input_sec, "sec")
print("NN decoding:\t", decode_sec, "sec")
print("Real-time factor:\t", decode_sec / input_sec)

As you can see above, ESPnet-ST can translate speech faster than the input (it should be RTF < 1.0).

Training ST models from scratch

We provide Kaldi-style recipes for ST as well as ASR and TTS as all-in-one bash script run.sh:

!cd /content/espnet/egs/must_c/st1/ && ./run.sh --must-c /content

However, it takes too much time to finish downloading the dataset. So we cancel the cell above.

Details of ESPnet tools

!../../../utils/translate_wav.sh --help

!../../../utils/synth_wav.sh --help