ESPnet Speech Translation Demonstration
ESPnet Speech Translation Demonstration
See also
- ESPnet: https://github.com/espnet/espnet
- ESPnet documentation: https://espnet.github.io/espnet/
- TTS demo: https://colab.research.google.com/github/espnet/notebook/blob/master/tts_realtime_demo.ipynb
Author: Shigeki Karita
Install
It takes around 3 minutes. Please keep waiting for a while.
# OS setup
!cat /etc/os-release
!apt-get install -qq bc tree sox
# espnet and moses setup
!git clone -q https://github.com/ShigekiKarita/espnet.git
!pip install -q torch==1.1
!cd espnet; git checkout c0466d9a356c1a33f671a546426d7bc33b5b17e8; pip install -q -e .
!cd espnet/tools/; make moses.done
# download pre-compiled warp-ctc and kaldi tools
!espnet/utils/download_from_google_drive.sh \
"https://drive.google.com/open?id=13Y4tSygc8WtqzvAVGK_vRV9GlV7TRC0w" espnet/tools tar.gz > /dev/null
# make dummy activate
!mkdir -p espnet/tools/venv/bin && touch espnet/tools/venv/bin/activate
!echo "setup done."
Spanish speech -> English text translation
This audio says "yo soy José."
from IPython.display import display, Audio
display(Audio("/content/espnet/test_utils/st_test.wav", rate=16000))
Let's translate this into English text by our pretrained Transformer ST model trained on the Fisher-CALLHOME Spanish dataset.
# move on the recipe directory
import os
os.chdir("/content/espnet/egs/fisher_callhome_spanish/st1")
!../../../utils/translate_wav.sh --models fisher_callhome_spanish.transformer.v1.es-en ../../../test_utils/st_test.wav | tee /content/translated.txt
As seen above, we successfully obtained the result: "Translated text: yes i'm jose"!
English translated text-to-speech synthesis
Now let's generate an English speech from the translated text using a pretrained ESPnet-TTS model.
!sed -n 's/Translated text://p' /content/translated.txt | tr '[:lower:]' '[:upper:]' | tee /content/translated_sed.txt
!../../../utils/synth_wav.sh /content/translated_sed.txt
import matplotlib.pyplot as plt
import kaldiio
fbank = next(iter(kaldiio.load_scp("decode/translated_sed/outputs/feats.scp").values()))
plt.matshow(fbank.T)
from IPython.display import display, Audio
display(Audio("decode/translated_sed/wav_wnv/translated_sed_gen.wav"))
Successfully, it says "Yes I'm Jose"! For more TTS demo, visit https://colab.research.google.com/github/espnet/notebook/blob/master/tts_realtime_demo.ipynb
Check decoding log
After the translation, you will find <decode_dir>/<wav name>/result.json
for details;
!cat decode/st_test/result.json
and <decode_dir>/<wav name>/log/decode.log
for runtime log;
!cat decode/st_test/log/decode.log
Let's calculate real-time factor (RTF) of the ST decoding from the decode.log
from dateutil import parser
from subprocess import PIPE, run
# calc input duration (seconds)
input_sec = float(run(["soxi", "-D", "/content/espnet/test_utils/st_test.wav"], stdout=PIPE).stdout)
# calc NN decoding time
with open("decode/st_test/log/decode.log", "r") as f:
times = [parser.parse(x.split("(")[0]) for x in f if "e2e_st_transformer" in x]
decode_sec = (times[-1] - times[0]).total_seconds()
# get real-time factor (RTF)
print("Input duration:\t", input_sec, "sec")
print("NN decoding:\t", decode_sec, "sec")
print("Real-time factor:\t", decode_sec / input_sec)
As you can see above, ESPnet-ST can translate speech faster than the input (it should be RTF < 1.0).
Training ST models from scratch
We provide Kaldi-style recipes for ST as well as ASR and TTS as all-in-one bash script run.sh
:
!cd /content/espnet/egs/must_c/st1/ && ./run.sh --must-c /content
However, it takes too much time to finish downloading the dataset. So we cancel the cell above.
Details of ESPnet tools
!../../../utils/translate_wav.sh --help
!../../../utils/synth_wav.sh --help