espnet.nets.pytorch_backend.fastspeech.duration_predictor.DurationPredictor

Less than 1 minute

espnet.nets.pytorch_backend.fastspeech.duration_predictor.DurationPredictor

class espnet.nets.pytorch_backend.fastspeech.duration_predictor.DurationPredictor(idim, n_layers=2, n_chans=384, kernel_size=3, dropout_rate=0.1, offset=1.0)

Bases: Module

Duration predictor module.

This is a module of duration predictor described in FastSpeech: Fast, Robust and Controllable Text to Speech. The duration predictor predicts a duration of each frame in log domain from the hidden embeddings of encoder.

NOTE

The calculation domain of outputs is different between in forward and in inference. In forward, the outputs are calculated in log domain but in inference, those are calculated in linear domain.

Initilize duration predictor module.

Parameters:
- idim (int) – Input dimension.
- n_layers (int , optional) – Number of convolutional layers.
- n_chans (int , optional) – Number of channels of convolutional layers.
- kernel_size (int , optional) – Kernel size of convolutional layers.
- dropout_rate (float , optional) – Dropout rate.
- offset (float , optional) – Offset value to avoid nan in log domain.

forward(xs, x_masks=None)

Calculate forward propagation.

Parameters:
- xs (Tensor) – Batch of input sequences (B, Tmax, idim).
- x_masks (ByteTensor , optional) – Batch of masks indicating padded part (B, Tmax).
Returns: Batch of predicted durations in log domain (B, Tmax).
Return type: Tensor

inference(xs, x_masks=None)

Inference duration.

Parameters:
- xs (Tensor) – Batch of input sequences (B, Tmax, idim).
- x_masks (ByteTensor , optional) – Batch of masks indicating padded part (B, Tmax).
Returns: Batch of predicted durations in linear domain (B, Tmax).
Return type: LongTensor