espnet2.samplers.category_power_sampler.CategoryDatasetPowerSampler

About 2 min

espnet2.samplers.category_power_sampler.CategoryDatasetPowerSampler

class espnet2.samplers.category_power_sampler.CategoryDatasetPowerSampler(batch_bins: int, shape_files: Tuple[str, ...] | List[str], min_batch_size: int = 1, max_batch_size: int | None = None, category_upsampling_factor: float = 1.0, dataset_upsampling_factor: float = 1.0, dataset_scaling_factor: float = 1.2, drop_last: bool = False, category2utt_file: str | None = None, dataset2utt_file: str | None = None, utt2dataset_file: str | None = None, epoch: int = 1, **kwargs)

Bases: AbsSampler

A category- and dataset-balanced batch sampler with power-law sampling.

Reference: : Scaling Speech Technology to 1,000+ Languages https://arxiv.org/pdf/2305.13516

This sampler is designed for multi-category, multi-dataset training where both category imbalance and dataset imbalance exist. It performs hierarchical sampling: (1) balancing categories (e.g., languages) within each dataset, and (2) balancing datasets themselves.

Sampling Strategy:

Let:

d ∈ {1, 2, …, D} denote the dataset index
l ∈ {1, 2, …, L_d} denote the category index in dataset d
n_ld: total duration (number of bins) of category l in dataset d
k_ld: the number of utterances in category l in dataset d
N_d = ∑_l n_ld: total duration (number of bins) of all categories

in dataset d

M = ∑_d N_d: total duration (number of bins) of all categories across : all datasets

Step 1 — Category-level sampling within each dataset: : P(l | d) ∝ (n_ld / N_d)^β_L

where β_L (category_upsampling_factor) controls how strongly to upsample low-resource languages within each dataset. The normalized probability becomes:

P(l | d) = [(n_ld / N_d)^β_L] / ∑_l’[(n_l’d / N_d)^β_L]

Step 2 — Dataset-level sampling based on resampled language distributions:

For each dataset d, the resampled number of bins for category l is: : n_ld’ = N_d × P(l | d)

Since the category probabilities sum to 1 within each dataset (∑_l P(l | d) = 1), the total resampled bins (N_d’) for dataset d is:

N_d’ = ∑_l n_ld’ = N_d

The probability of sampling dataset d is then: : P(d) = [(N_d / M)^β_D] / ∑_d[(N_d / M)^β_D]

where:

β_D is dataset_upsampling_factor

Final utterance sampling probability: : P(x) = P(d) × P(l | d) × P(x | l, d), where P(x | l, d) = 1 / k_ld

Note:

Batches are constructed based on batch_bins, similar to LengthBatchSampler.
Set batch_type=catpow_balance_dataset to enable this sampler.
This sampler is particularly useful when combining heterogeneous datasets (e.g., FLEURS + VoxLingua107 + BABEL) with highly imbalanced language and size distributions.

Parameters:
- batch_bins – The approximate maximum number of bins (e.g., audio samples) in a batch.
- shape_files – A list or tuple of shape file paths. Only one shape file is supported, but the list format is retained for compatibility with other samplers.
- min_batch_size – Minimum number of utterances in a batch.
- max_batch_size – Maximum number of utterances in a batch (recommended for memory safety).
- category_upsampling_factor – β_L in the formula; controls per-dataset category balancing.
- dataset_upsampling_factor – β_D in the formula; controls balancing between datasets.
- dataset_scaling_factor – A multiplier that determines the total number of utterances sampled. Values > 1 simulate more frequent use of low-resource utterances across batches. Must be ≥ 1.
- drop_last – Whether to drop the final batch.
- category2utt_file – Path to a file mapping each category to utterance ID.
- dataset2utt_file – Path to a file mapping each dataset to utterance ID.
- utt2dataset_file – Path to a file mapping each utterance ID to its corresponding dataset label.
- epoch – Random seed is set using the epoch to ensure reproducibility with variation across epochs.