espnet_onnx demonstration

About 2 min

espnet_onnx demonstration

This notebook provides a demonstration of how to export your trained model into onnx format. Currently only ASR is supported.

Install Dependency

To run this demo, you need to install the following packages.

espnet_onnx
torch >= 1.11.0 (already installed in Colab)
espnet
espnet_model_zoo
onnx

torch, espnet, espnet_model_zoo, onnx is required to run the exportation demo.

!pip install -U espnet_onnx espnet espnet_model_zoo onnx

# in this demo, we need to update scipy to avoid an error
!pip install -U scipy

Export your model

Export model from espnet_model_zoo

The easiest way to export a model is to use espnet_model_zoo. You can download, unpack, and export the pretrained models with export_from_pretrained method. espnet_onnx will save the onnx models into cache directory, which is ${HOME}/.cache/espnet_onnx in default.

# export the model.
from espnet_onnx.export import ModelExport

tag_name = 'kamo-naoyuki/timit_asr_train_asr_raw_word_valid.acc.ave'

m = ModelExport()
m.export_from_pretrained(tag_name)

Export from custom model

espnet_onnx can also export your own trained model with export method.

The following script shows how to export from espnet2.bin.asr_inference.Speech2Text instance. You can also export from a zipped file, by using the export_from_zip function.
For this demonstration, I'm using the from_pretrained method to load parameters, but you can load your own model.

# prepare the espnet2.bin.asr_inference.Speech2Text instance.
from espnet2.bin.asr_inference import Speech2Text

tag_name = 'kamo-naoyuki/timit_asr_train_asr_raw_word_valid.acc.ave'
speech2text = Speech2Text.from_pretrained(tag_name)


# export model
from espnet_onnx.export import ModelExport

sample_model_tag = 'demo/sample_model_1'
m = ModelExport()
m.export(
    speech2text,
    sample_model_tag,
    quantize=False
)

Inference with onnx

Now, let's use the exported models for inference.

# please provide the tag_name to specify exported model.
tag_name = 'kamo-naoyuki/timit_asr_train_asr_raw_word_valid.acc.ave'


# upload wav file and let's inference!
import librosa
from google.colab import files

wav_file = files.upload()
y, sr = librosa.load(list(wav_file.keys())[0], sr=16000)


# Use the exported onnx file to inference.
from espnet_onnx import Speech2Text

speech2text = Speech2Text(tag_name)
nbest = speech2text(y)
print(nbest[0][0])

Using streaming model

Model exportation is exactly the same as non-streaming model. You can follow the #Export your model chapter.

As for streaming, you can specify the following configuration additionaly. Usually, these values should be the same as the training configuration.

block_size
hop_size
look_ahead

The length of the speech should be the same as streaming_model.hop_size. This value is calculated as follows

$$ \begin{align} h &= \text{hop_size} * \text{encoder.subsample} * \text{stft.hop_length}\ \text{padding} &= (\text{stft.n_fft} // \text{stft.hop_length}) * \text{stft.hop_length} \ \text{len(wav)} &= h + \text{padding} \end{align} $$

For example, the length of the speech is 8704 with the following configuration.

block_size = 40
hop_size = 16
look_ahead = 16
encoder.subsample = 4
stft.n_fft = 512
stft.hop_length = 128

Now, let's demonstrate the streaming inference.

# Export the streaming model.
# Note that the following model is very large
from espnet_onnx.export import ModelExport

tag_name = 'D-Keqi/espnet_asr_train_asr_streaming_transformer_raw_en_bpe500_sp_valid.acc.ave'

m = ModelExport()
m.export_from_pretrained(tag_name)

# In this tutorial, we will use the recorded wav file to simulate streaming.
import librosa
from espnet_onnx import StreamingSpeech2Text

tag_name = 'D-Keqi/espnet_asr_train_asr_streaming_transformer_raw_en_bpe500_sp_valid.acc.ave'
streaming_model = StreamingSpeech2Text(tag_name)

# upload wav file
from google.colab import files
wav_file = files.upload()
y, sr = librosa.load(list(wav_file.keys())[0], sr=16000)

num_process = len(y) // streaming_model.hop_size + 1
print(f"I will split your audio file into {num_process} blocks.")

# simulate streaming.
streaming_model.start()
for i in range(num_process):
  # prepare wav file
  start = i * streaming_model.hop_size
  end = (i + 1) * streaming_model.hop_size
  wav_streaming = y[start : end]

  # apply padding if len(wav_streaming) < streaming_model.hop_size
  wav_streaming = streaming_model.pad(wav_streaming)
  
  # compute asr
  nbest = streaming_model(wav_streaming)
  print(f'Result at position {i} : {nbest[0][0]}')

final_nbest = streaming_model.end()
print(f'Final result : {final_nbest[0][0]}')