espnet_onnx demonstration
espnet_onnx demonstration
This notebook provides a demonstration of how to export your trained model into onnx format. Currently only ASR is supported.
see also:
- ESPnet: https://github.com/espnet/espnet
- espnet_onnx: https://github.com/Masao-Someki/espnet_onnx
Author: Masao Someki
Table of Contents
- Install Dependency
- Export your model
- Inference with onnx
- Using streaming model
Install Dependency
To run this demo, you need to install the following packages.
- espnet_onnx
- torch >= 1.11.0 (already installed in Colab)
- espnet
- espnet_model_zoo
- onnx
torch
, espnet
, espnet_model_zoo
, onnx
is required to run the exportation demo.
!pip install -U espnet_onnx espnet espnet_model_zoo onnx
# in this demo, we need to update scipy to avoid an error
!pip install -U scipy
Export your model
Export model from espnet_model_zoo
The easiest way to export a model is to use espnet_model_zoo
. You can download, unpack, and export the pretrained models with export_from_pretrained
method. espnet_onnx
will save the onnx models into cache directory, which is ${HOME}/.cache/espnet_onnx
in default.
# export the model.
from espnet_onnx.export import ModelExport
tag_name = 'kamo-naoyuki/timit_asr_train_asr_raw_word_valid.acc.ave'
m = ModelExport()
m.export_from_pretrained(tag_name)
Export from custom model
espnet_onnx
can also export your own trained model with export
method.
The following script shows how to export from espnet2.bin.asr_inference.Speech2Text
instance. You can also export from a zipped file, by using the export_from_zip
function.
For this demonstration, I'm using the from_pretrained
method to load parameters, but you can load your own model.
# prepare the espnet2.bin.asr_inference.Speech2Text instance.
from espnet2.bin.asr_inference import Speech2Text
tag_name = 'kamo-naoyuki/timit_asr_train_asr_raw_word_valid.acc.ave'
speech2text = Speech2Text.from_pretrained(tag_name)
# export model
from espnet_onnx.export import ModelExport
sample_model_tag = 'demo/sample_model_1'
m = ModelExport()
m.export(
speech2text,
sample_model_tag,
quantize=False
)
Inference with onnx
Now, let's use the exported models for inference.
# please provide the tag_name to specify exported model.
tag_name = 'kamo-naoyuki/timit_asr_train_asr_raw_word_valid.acc.ave'
# upload wav file and let's inference!
import librosa
from google.colab import files
wav_file = files.upload()
y, sr = librosa.load(list(wav_file.keys())[0], sr=16000)
# Use the exported onnx file to inference.
from espnet_onnx import Speech2Text
speech2text = Speech2Text(tag_name)
nbest = speech2text(y)
print(nbest[0][0])
Using streaming model
Model exportation is exactly the same as non-streaming model. You can follow the #Export your model
chapter.
As for streaming, you can specify the following configuration additionaly. Usually, these values should be the same as the training configuration.
- block_size
- hop_size
- look_ahead
The length of the speech should be the same as streaming_model.hop_size
. This value is calculated as follows
$$ \begin{align} h &= \text{hop_size} * \text{encoder.subsample} * \text{stft.hop_length}\ \text{padding} &= (\text{stft.n_fft} // \text{stft.hop_length}) * \text{stft.hop_length} \ \text{len(wav)} &= h + \text{padding} \end{align} $$
For example, the length of the speech is 8704 with the following configuration.
- block_size = 40
- hop_size = 16
- look_ahead = 16
- encoder.subsample = 4
- stft.n_fft = 512
- stft.hop_length = 128
Now, let's demonstrate the streaming inference.
# Export the streaming model.
# Note that the following model is very large
from espnet_onnx.export import ModelExport
tag_name = 'D-Keqi/espnet_asr_train_asr_streaming_transformer_raw_en_bpe500_sp_valid.acc.ave'
m = ModelExport()
m.export_from_pretrained(tag_name)
# In this tutorial, we will use the recorded wav file to simulate streaming.
import librosa
from espnet_onnx import StreamingSpeech2Text
tag_name = 'D-Keqi/espnet_asr_train_asr_streaming_transformer_raw_en_bpe500_sp_valid.acc.ave'
streaming_model = StreamingSpeech2Text(tag_name)
# upload wav file
from google.colab import files
wav_file = files.upload()
y, sr = librosa.load(list(wav_file.keys())[0], sr=16000)
num_process = len(y) // streaming_model.hop_size + 1
print(f"I will split your audio file into {num_process} blocks.")
# simulate streaming.
streaming_model.start()
for i in range(num_process):
# prepare wav file
start = i * streaming_model.hop_size
end = (i + 1) * streaming_model.hop_size
wav_streaming = y[start : end]
# apply padding if len(wav_streaming) < streaming_model.hop_size
wav_streaming = streaming_model.pad(wav_streaming)
# compute asr
nbest = streaming_model(wav_streaming)
print(f'Result at position {i} : {nbest[0][0]}')
final_nbest = streaming_model.end()
print(f'Final result : {final_nbest[0][0]}')