CMU 11492/11692 Spring 2023: Speech Enhancement

In this demonstration, we will show you some demonstrations of speech enhancement systems in ESPnet.

Main references: - ESPnet repository - ESPnet documentation - ESPnet-SE repo

Author: - Siddhant Arora (siddhana@andrew.cmu.edu)

The notebook is adapted from this Colab

❗Important Notes❗

  • We are using Colab to show the demo. However, Colab has some constraints on the total GPU runtime. If you use too much GPU time, you may not be able to use GPU for some time.

  • There are multiple in-class checkpoints ✅ throughout this tutorial. Your participation points are based on these tasks. Please try your best to follow all the steps! If you encounter issues, please notify the TAs as soon as possible so that we can make an adjustment for you.

  • Please submit PDF files of your completed notebooks to Gradescope. You can print the notebook using File -> Print in the menu bar.You also need to submit the spectrogram and waveform of noisy and enhanced audio files to Gradescope.

Contents

Tutorials on the Basic Usage

  1. Install

  2. Speech Enhancement with Pretrained Models

We support various interfaces, e.g. Python API, HuggingFace API, portable speech enhancement scripts for other tasks, etc.

2.1 Single-channel Enhancement (CHiME-4)

2.2 Enhance Your Own Recordings

2.3 Multi-channel Enhancement (CHiME-4)

  1. Speech Separation with Pretrained Models

3.1 Model Selection

3.2 Separate Speech Mixture

  1. Evaluate Separated Speech with the Pretrained ASR Model

Tutorials on the Basic Usage

Install

Different from previous assignment where we install the full version of ESPnet, we use a lightweight ESPnet package, which mainly designed for inference purpose. The installation with the light version can be much faster than a full installation.

[ ]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"
%pip uninstall torch
%pip install torch==1.13.0+cu117 torchvision==0.14.0+cu117 torchaudio==0.13.0 --extra-index-url https://download.pytorch.org/whl/cu117
%pip install -q git+https://github.com/espnet/espnet
%pip install -q espnet_model_zoo

Speech Enhancement with Pretrained Models

Single-Channel Enhancement, the CHiME example

Task1 (✅ Checkpoint 1 (1 point))

Run inference of pretrained single-channel enhancement model.

[ ]:
# Download one utterance from real noisy speech of CHiME4
!gdown --id 1SmrN5NFSg6JuQSs2sfy3ehD8OIcqK6wS -O /content/M05_440C0213_PED_REAL.wav
import os

import soundfile
from IPython.display import display, Audio
mixwav_mc, sr = soundfile.read("/content/M05_440C0213_PED_REAL.wav")
# mixwav.shape: num_samples, num_channels
mixwav_sc = mixwav_mc[:,4]
display(Audio(mixwav_mc.T, rate=sr))

Download and load the pretrained Conv-Tasnet

[ ]:
!gdown --id 17DMWdw84wF3fz3t7ia1zssdzhkpVQGZm -O /content/chime_tasnet_singlechannel.zip
!unzip /content/chime_tasnet_singlechannel.zip -d /content/enh_model_sc
[ ]:
# Load the model
# If you encounter error "No module named 'espnet2'", please re-run the 1st Cell. This might be a colab bug.
import sys
import soundfile
from espnet2.bin.enh_inference import SeparateSpeech


separate_speech = {}
# For models downloaded from GoogleDrive, you can use the following script:
enh_model_sc = SeparateSpeech(
  train_config="/content/enh_model_sc/exp/enh_train_enh_conv_tasnet_raw/config.yaml",
  model_file="/content/enh_model_sc/exp/enh_train_enh_conv_tasnet_raw/5epoch.pth",
  # for segment-wise process on long speech
  normalize_segment_scale=False,
  show_progressbar=True,
  ref_channel=4,
  normalize_output_wav=True,
  device="cuda:0",
)

Enhance the single-channel real noisy speech in CHiME4

Please submit the screenshot of output of current block and the spectogram and waveform of noisy and enhanced speech file to Gradescope for Task 1.

[ ]:
# play the enhanced single-channel speech
wave = enh_model_sc(mixwav_sc[None, ...], sr)

print("Input real noisy speech", flush=True)
display(Audio(mixwav_sc, rate=sr))
print("Enhanced speech", flush=True)
display(Audio(wave[0].squeeze(), rate=sr))

Multi-Channel Enhancement

Download and load the pretrained mvdr neural beamformer.

Task2 (✅ Checkpoint 2 (1 point))

Run inference of pretrained multi-channel enhancement model.

[ ]:
# Download the pretained enhancement model

!gdown --id 1FohDfBlOa7ipc9v2luY-QIFQ_GJ1iW_i -O /content/mvdr_beamformer_16k_se_raw_valid.zip
!unzip /content/mvdr_beamformer_16k_se_raw_valid.zip -d /content/enh_model_mc
[ ]:
# Load the model
# If you encounter error "No module named 'espnet2'", please re-run the 1st Cell. This might be a colab bug.
import sys
import soundfile
from espnet2.bin.enh_inference import SeparateSpeech


separate_speech = {}
# For models downloaded from GoogleDrive, you can use the following script:
enh_model_mc = SeparateSpeech(
  train_config="/content/enh_model_mc/exp/enh_train_enh_beamformer_mvdr_raw/config.yaml",
  model_file="/content/enh_model_mc/exp/enh_train_enh_beamformer_mvdr_raw/11epoch.pth",
  # for segment-wise process on long speech
  normalize_segment_scale=False,
  show_progressbar=True,
  ref_channel=4,
  normalize_output_wav=True,
  device="cuda:0",
)

Enhance the multi-channel real noisy speech in CHiME4

Please submit the screenshot of output of current block and the spectrogram and waveform of noisy and enhanced speech file to Gradescope for Task 2.

[ ]:
wave = enh_model_mc(mixwav_mc[None, ...], sr)
print("Input real noisy speech", flush=True)
display(Audio(mixwav_mc.T, rate=sr))
print("Enhanced speech", flush=True)
display(Audio(wave[0].squeeze(), rate=sr))

Portable speech enhancement scripts for other tasks

For an ESPNet ASR or TTS dataset like below:

data
`-- et05_real_isolated_6ch_track
    |-- spk2utt
    |-- text
    |-- utt2spk
    |-- utt2uniq
    `-- wav.scp

Run the following scripts to create an enhanced dataset:

scripts/utils/enhance_dataset.sh \
    --spk_num 1 \
    --gpu_inference true \
    --inference_nj 4 \
    --fs 16k \
    --id_prefix "" \
    dump/raw/et05_real_isolated_6ch_track \
    data/et05_real_isolated_6ch_track_enh \
    exp/enh_train_enh_beamformer_mvdr_raw/valid.loss.best.pth

The above script will generate a new directory data/et05_real_isolated_6ch_track_enh:

data
`-- et05_real_isolated_6ch_track_enh
    |-- spk2utt
    |-- text
    |-- utt2spk
    |-- utt2uniq
    |-- wav.scp
    `-- wavs/

where wav.scp contains paths to the enhanced audios (stored in wavs/).

Speech Separation

Model Selection

In this demonstration, we will show different speech separation models on wsj0_2mix.

The pretrained models can be download from a direct URL, or from zenodo and huggingface with the corresponding model ID.

[ ]:
!gdown --id 1TasZxZSnbSPsk_Wf7ZDhBAigS6zN8G9G -O enh_train_enh_tfgridnet_tf_lr-patience3_patience5_raw_valid.loss.ave.zip
!unzip enh_train_enh_tfgridnet_tf_lr-patience3_patience5_raw_valid.loss.ave.zip -d /content/enh_model_ss
[ ]:
import sys
import soundfile
from espnet2.bin.enh_inference import SeparateSpeech

# For models downloaded from GoogleDrive, you can use the following script:
separate_speech = SeparateSpeech(
  train_config="/content/enh_model_ss/exp/enh_train_enh_tfgridnet_tf_lr-patience3_patience5_raw/config.yaml",
  model_file="/content/enh_model_ss/exp/enh_train_enh_tfgridnet_tf_lr-patience3_patience5_raw/98epoch.pth",
  # for segment-wise process on long speech
  segment_size=2.4,
  hop_size=0.8,
  normalize_segment_scale=False,
  show_progressbar=True,
  ref_channel=None,
  normalize_output_wav=True,
  device="cuda:0",
)

Separate Speech Mixture

Separate the example in wsj0_2mix testing set

Task3 (✅ Checkpoint 3 (1 point))

Run inference of pretrained speech seperation model based on TF-GRIDNET.

Please submit the screenshot of output of current block and the spectrogram and waveform of mixed and seperated speech files to Gradescope for Task 3.

[ ]:
!gdown --id 1ZCUkd_Lb7pO2rpPr4FqYdtJBZ7JMiInx -O /content/447c020t_1.2106_422a0112_-1.2106.wav

import os
import soundfile
from IPython.display import display, Audio

mixwav, sr = soundfile.read("447c020t_1.2106_422a0112_-1.2106.wav")
waves_wsj = separate_speech(mixwav[None, ...], fs=sr)

print("Input mixture", flush=True)
display(Audio(mixwav, rate=sr))
print(f"========= Separated speech with model =========", flush=True)
print("Separated spk1", flush=True)
display(Audio(waves_wsj[0].squeeze(), rate=sr))
print("Separated spk2", flush=True)
display(Audio(waves_wsj[1].squeeze(), rate=sr))

Show spectrums of separated speech

Show wavform and spectrogram of mixed and seperated speech.

[ ]:
import matplotlib.pyplot as plt
import torch
from torch_complex.tensor import ComplexTensor

from espnet.asr.asr_utils import plot_spectrogram
from espnet2.layers.stft import Stft


stft = Stft(
  n_fft=512,
  win_length=None,
  hop_length=128,
  window="hann",
)
ilens = torch.LongTensor([len(mixwav)])
# specs: (T, F)
spec_mix = ComplexTensor(
    *torch.unbind(
      stft(torch.as_tensor(mixwav).unsqueeze(0), ilens)[0].squeeze(),
      dim=-1
  )
)
spec_sep1 = ComplexTensor(
    *torch.unbind(
      stft(torch.as_tensor(waves_wsj[0]), ilens)[0].squeeze(),
      dim=-1
  )
)
spec_sep2 = ComplexTensor(
    *torch.unbind(
      stft(torch.as_tensor(waves_wsj[1]), ilens)[0].squeeze(),
      dim=-1
  )
)

samples = torch.linspace(0, len(mixwav) / sr, len(mixwav))
plt.figure(figsize=(24, 12))
plt.subplot(3, 2, 1)
plt.title('Mixture Spectrogram')
plot_spectrogram(
  plt, abs(spec_mix).transpose(-1, -2).numpy(), fs=sr,
  mode='db', frame_shift=None,
  bottom=False, labelbottom=False
)
plt.subplot(3, 2, 2)
plt.title('Mixture Wavform')
plt.plot(samples, mixwav)
plt.xlim(0, len(mixwav) / sr)

plt.subplot(3, 2, 3)
plt.title('Separated Spectrogram (spk1)')
plot_spectrogram(
  plt, abs(spec_sep1).transpose(-1, -2).numpy(), fs=sr,
  mode='db', frame_shift=None,
  bottom=False, labelbottom=False
)
plt.subplot(3, 2, 4)
plt.title('Separated Wavform (spk1)')
plt.plot(samples, waves_wsj[0].squeeze())
plt.xlim(0, len(mixwav) / sr)

plt.subplot(3, 2, 5)
plt.title('Separated Spectrogram (spk2)')
plot_spectrogram(
  plt, abs(spec_sep2).transpose(-1, -2).numpy(), fs=sr,
  mode='db', frame_shift=None,
  bottom=False, labelbottom=False
)
plt.subplot(3, 2, 6)
plt.title('Separated Wavform (spk2)')
plt.plot(samples, waves_wsj[1].squeeze())
plt.xlim(0, len(mixwav) / sr)
plt.xlabel("Time (s)")
plt.show()

Evaluate separated speech with pretrained ASR model

The ground truths are:

text_1: SOME CRITICS INCLUDING HIGH REAGAN ADMINISTRATION OFFICIALS ARE RAISING THE ALARM THAT THE FED'S POLICY IS TOO TIGHT AND COULD CAUSE A RECESSION NEXT YEAR

text_2: THE UNITED STATES UNDERTOOK TO DEFEND WESTERN EUROPE AGAINST SOVIET ATTACK

(This may take a while for the speech recognition.)

[ ]:
%pip install -q https://github.com/kpu/kenlm/archive/master.zip # ASR needs kenlm

Task4 (✅ Checkpoint 4 (1 point))

Show inference of pre-trained ASR model on mixed and seperated speech.

[ ]:
!gdown --id 1H7--jXTTwmwxzfO8LT5kjZyBjng-HxED -O asr_train_asr_transformer_raw_char_1gpu_valid.acc.ave.zip
!unzip asr_train_asr_transformer_raw_char_1gpu_valid.acc.ave.zip -d /content/asr_model
!ln -sf /content/asr_model/exp .

Please submit the screenshot of ASR inference on Mix Speech and Separated Speech 1 and Separated Speech 2 files to Gradescope for Task 4.

[ ]:
import espnet_model_zoo
from espnet2.bin.asr_inference import Speech2Text


# For models downloaded from GoogleDrive, you can use the following script:
speech2text = Speech2Text(
  asr_train_config="/content/asr_model/exp/asr_train_asr_transformer_raw_char_1gpu/config.yaml",
  asr_model_file="/content/asr_model/exp/asr_train_asr_transformer_raw_char_1gpu/valid.acc.ave_10best.pth",
  device="cuda:0"
)

text_est = [None, None]
text_est[0], *_ = speech2text(waves_wsj[0].squeeze())[0]
text_est[1], *_ = speech2text(waves_wsj[1].squeeze())[0]
text_m, *_ = speech2text(mixwav)[0]
print("Mix Speech to Text: ", text_m)
print("Separated Speech 1 to Text: ", text_est[0])
print("Separated Speech 2 to Text: ", text_est[1])
[ ]:
import difflib
from itertools import permutations

import editdistance
import numpy as np

colors = dict(
    red=lambda text: f"\033[38;2;255;0;0m{text}\033[0m" if text else "",
    green=lambda text: f"\033[38;2;0;255;0m{text}\033[0m" if text else "",
    yellow=lambda text: f"\033[38;2;225;225;0m{text}\033[0m" if text else "",
    white=lambda text: f"\033[38;2;255;255;255m{text}\033[0m" if text else "",
    black=lambda text: f"\033[38;2;0;0;0m{text}\033[0m" if text else "",
)

def diff_strings(ref, est):
    """Reference: https://stackoverflow.com/a/64404008/7384873"""
    ref_str, est_str, err_str = [], [], []
    matcher = difflib.SequenceMatcher(None, ref, est)
    for opcode, a0, a1, b0, b1 in matcher.get_opcodes():
        if opcode == "equal":
            txt = ref[a0:a1]
            ref_str.append(txt)
            est_str.append(txt)
            err_str.append(" " * (a1 - a0))
        elif opcode == "insert":
            ref_str.append("*" * (b1 - b0))
            est_str.append(colors["green"](est[b0:b1]))
            err_str.append(colors["black"]("I" * (b1 - b0)))
        elif opcode == "delete":
            ref_str.append(ref[a0:a1])
            est_str.append(colors["red"]("*" * (a1 - a0)))
            err_str.append(colors["black"]("D" * (a1 - a0)))
        elif opcode == "replace":
            diff = a1 - a0 - b1 + b0
            if diff >= 0:
                txt_ref = ref[a0:a1]
                txt_est = colors["yellow"](est[b0:b1]) + colors["red"]("*" * diff)
                txt_err = "S" * (b1 - b0) + "D" * diff
            elif diff < 0:
                txt_ref = ref[a0:a1] + "*" * -diff
                txt_est = colors["yellow"](est[b0:b1]) + colors["green"]("*" * -diff)
                txt_err = "S" * (b1 - b0) + "I" * -diff

            ref_str.append(txt_ref)
            est_str.append(txt_est)
            err_str.append(colors["black"](txt_err))
    return "".join(ref_str), "".join(est_str), "".join(err_str)


text_ref = [
  "SOME CRITICS INCLUDING HIGH REAGAN ADMINISTRATION OFFICIALS ARE RAISING THE ALARM THAT THE FED'S POLICY IS TOO TIGHT AND COULD CAUSE A RECESSION NEXT YEAR",
  "THE UNITED STATES UNDERTOOK TO DEFEND WESTERN EUROPE AGAINST SOVIET ATTACK",
]

print("=====================" , flush=True)
perms = list(permutations(range(2)))
string_edit = [
  [
    editdistance.eval(text_ref[m], text_est[n])
    for m, n in enumerate(p)
  ]
  for p in perms
]

dist = [sum(edist) for edist in string_edit]
perm_idx = np.argmin(dist)
perm = perms[perm_idx]

for i, p in enumerate(perm):
  print("\n--------------- Text %d ---------------" % (i + 1), flush=True)
  ref, est, err = diff_strings(text_ref[i], text_est[p])
  print("REF: " + ref + "\n" + "HYP: " + est + "\n" + "ERR: " + err, flush=True)
  print("Edit Distance = {}\n".format(string_edit[perm_idx][i]), flush=True)

Task5 (✅ Checkpoint 5 (1 point))

Enhance your own pre-recordings. Your input speech can be recorded by yourself or you can also find it from other sources (e.g., youtube).

Discuss whether input speech was clearly denoised, and if not, what would be a potential reason.

[YOUR ANSWER HERE]

Please submit the spectrogram and waveform of your input and enhanced speech to GradeScope for Task 5 along with the screenshot of your answer.

[ ]:
from google.colab import files
from IPython.display import display, Audio
import soundfile
fs = 16000
uploaded = files.upload()

for file_name in uploaded.keys():
  speech, rate = soundfile.read(file_name)
  assert rate == fs, "mismatch in sampling rate"
  wave = enh_model_sc(speech[None, ...], fs)
  print(f"Your input speech {file_name}", flush=True)
  display(Audio(speech, rate=fs))
  print(f"Enhanced speech for {file_name}", flush=True)
  display(Audio(wave[0].squeeze(), rate=fs))