{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# ESPnet2 real streaming Transformer demonstration\n",
    "Details in \"Streaming Transformer ASR with Blockwise Synchronous Beam Search\"\n",
    "(https://arxiv.org/abs/2006.14941)\n",
    "\n",
    "This local notebook provides a demonstration of streaming ASR based on Transformer using ESPnet2.\n",
    "\n",
    "You can recognize a recorded audio file or a speech online.\n",
    "\n",
    "Author: Keqi Deng (UCAS)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train a streaming Transformer model\n",
    "You can train a streaming Transformer model on your own corpus following the example of https://github.com/espnet/espnet/blob/master/egs2/aishell/asr1/run_streaming.sh"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download pre-trained model and audio file for demo\n",
    "You can download the pre-trained model from the ESPnet_model_zoo or directly from Huggingface.\n",
    "### For Mandarin Task (Pretrained using AISHELL-1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tag='Emiru Tsunoo/aishell_asr_train_asr_streaming_transformer_raw_zh_char_sp_valid.acc.ave'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### For English Task (Pretrained using Tedlium2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tag='D-Keqi/espnet_asr_train_asr_streaming_transformer_raw_en_bpe500_sp_valid.acc.ave'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Import packages\n",
    "Make sure that you have installed the latest ESPnet"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "import espnet\n",
    "from espnet2.bin.asr_inference_streaming import Speech2TextStreaming\n",
    "from espnet_model_zoo.downloader import ModelDownloader\n",
    "import argparse\n",
    "import numpy as np\n",
    "import wave"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prepare for inference"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d=ModelDownloader()\n",
    "speech2text = Speech2TextStreaming(\n",
    "    **d.download_and_unpack(tag),\n",
    "    token_type=None,\n",
    "    bpemodel=None,\n",
    "    maxlenratio=0.0,\n",
    "    minlenratio=0.0,\n",
    "    beam_size=20,\n",
    "    ctc_weight=0.5,\n",
    "    lm_weight=0.0,\n",
    "    penalty=0.0,\n",
    "    nbest=1,\n",
    "    device = \"cpu\",\n",
    "    disable_repetition_detection=True,\n",
    "    decoder_text_length_limit=0,\n",
    "    encoded_feat_length_limit=0\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "prev_lines = 0\n",
    "def progress_output(text):\n",
    "    global prev_lines\n",
    "    lines=['']\n",
    "    for i in text:\n",
    "        if len(lines[-1]) > 100:\n",
    "            lines.append('')\n",
    "        lines[-1] += i\n",
    "    for i,line in enumerate(lines):\n",
    "        if i == prev_lines:\n",
    "            sys.stderr.write('\\n\\r')\n",
    "        else:\n",
    "            sys.stderr.write('\\r\\033[B\\033[K')\n",
    "        sys.stderr.write(line)\n",
    "\n",
    "    prev_lines = len(lines)\n",
    "    sys.stderr.flush()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def recognize(wavfile):\n",
    "    with wave.open(wavfile, 'rb') as wavfile:\n",
    "        ch=wavfile.getnchannels()\n",
    "        bits=wavfile.getsampwidth()\n",
    "        rate=wavfile.getframerate()\n",
    "        nframes=wavfile.getnframes()\n",
    "        buf = wavfile.readframes(-1)\n",
    "        data=np.frombuffer(buf, dtype='int16')\n",
    "    speech = data.astype(np.float16)/32767.0 #32767 is the upper limit of 16-bit binary numbers and is used for the normalization of int to float.\n",
    "    sim_chunk_length = 640\n",
    "    if sim_chunk_length > 0:\n",
    "        for i in range(len(speech)//sim_chunk_length):\n",
    "            results = speech2text(speech=speech[i*sim_chunk_length:(i+1)*sim_chunk_length], is_final=False)\n",
    "            if results is not None and len(results) > 0:\n",
    "                nbests = [text for text, token, token_int, hyp in results]\n",
    "                text = nbests[0] if nbests is not None and len(nbests) > 0 else \"\"\n",
    "                progress_output(nbests[0])\n",
    "            else:\n",
    "                progress_output(\"\")\n",
    "            \n",
    "        results = speech2text(speech[(i+1)*sim_chunk_length:len(speech)], is_final=True)\n",
    "    else:\n",
    "        results = speech2text(speech, is_final=True)\n",
    "    nbests = [text for text, token, token_int, hyp in results]\n",
    "    progress_output(nbests[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Recognize the audio file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#You can upload your own audio file for recognition, and also we provide some demo audio files that you can download from Google drive. \n",
    "#For Mandarin task, the demo file comes from the AISSHELL-1: https://drive.google.com/file/d/1l8w93r8Bs5FtC3A-1ydEqFQdP4k6FiUL/view?usp=sharing\n",
    "#wavfile='./BAC009S0724W0121.wav'\n",
    "#For  English task, the demo file comes from the Librispeech: https://drive.google.com/file/d/1l71ZUNQ6qQk95T54H0tH_OEwZvWnEL4u/view?usp=sharing\n",
    "#wavfile='./61-70968-0000.wav'\n",
    "recognize(wavfile)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Recognize the speech from speaker"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Install pyaudio"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pyaudio"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Streamingly recognize with pyaudio"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "CHUNK=2048\n",
    "FORMAT=pyaudio.paInt16\n",
    "CHANNELS=1\n",
    "RATE=16000\n",
    "RECORD_SECONDS=5\n",
    "p=pyaudio.PyAudio()\n",
    "stream = p.open(format=FORMAT,channels=CHANNELS,rate=RATE,input=True,frames_per_buffer=CHUNK)\n",
    "for i in range(0,int(RATE/CHUNK*RECORD_SECONDS)+1):\n",
    "    data=stream.read(CHUNK)\n",
    "    data=np.frombuffer(data, dtype='int16')\n",
    "    data=data.astype(np.float16)/32767.0 #32767 is the upper limit of 16-bit binary numbers and is used for the normalization of int to float.\n",
    "    if i==int(RATE/CHUNK*RECORD_SECONDS):\n",
    "        results = speech2text(speech=data, is_final=True)\n",
    "        break\n",
    "    results = speech2text(speech=data, is_final=False)\n",
    "    if results is not None and len(results) > 0:\n",
    "        nbests = [text for text, token, token_int, hyp in results]\n",
    "        text = nbests[0] if nbests is not None and len(nbests) > 0 else \"\"\n",
    "        progress_output(nbests[0])\n",
    "    else:\n",
    "        progress_output(\"\")\n",
    "nbests = [text for text, token, token_int, hyp in results]\n",
    "progress_output(nbests[0])"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}