espnet2.speechlm.dataloader.iterator.DataIteratorFactory
espnet2.speechlm.dataloader.iterator.DataIteratorFactory
class espnet2.speechlm.dataloader.iterator.DataIteratorFactory(unregistered_specifier: str, registered_specifier: str, stats_dir: Path | str | None = None, loader_state: Path | None = None, collate_fn: Callable | None = None, batchfy_method: str = 'bucket', batch_size: int = 1000, num_workers: int = 4, rank: int = 0, world_size: int = 1, shuffle: bool = False, sequential_load: bool = False, seed: int = 42)
Bases: object
Factory for creating data iterators for SpeechLM training.
This class manages batching, data sharding across GPUs, and provides DataLoader instances for training with support for endless epochs.
Features: : - Supports multiple tasks and datasets with resampling factors
- Bucket or pack batching strategies
- Distributed training with automatic batch synchronization
- Deterministic shuffling with configurable seeds
- State saving/loading for training resumption
- Parameters:
- unregistered_specifier – Space-separated unregistered data specs. Format: “task:name:data_json[:factor]” Example: “asr:librispeech:train.json:2.0”
- registered_specifier – Space-separated registered data specs. Format: “task:name[:factor]” Example: “tts:ljspeech:1.5”
- stats_dir – Directory containing statistics files (str or Path). Each file should be named “stats_{task}_{data_name}.jsonl”
- collate_fn – Optional collate function for DataLoader.
- loader_state – Optional saved state dict to restore from.
- batchfy_method – Batching method (“bucket” or “pack”).
- batch_size – Maximum tokens per batch.
- num_workers – Number of DataLoader workers.
- rank – GPU rank for distributed training (0-indexed).
- world_size – Total number of GPUs in distributed training.
- shuffle – Whether to shuffle batches.
- seed – Random seed for reproducibility.
Example
>>> factory = DataIteratorFactory(
... unregistered_specifier="asr:libri:train.json:2.0",
... registered_specifier="tts:lj:1.0",
... stats_dir="/path/to/stats",
... batch_size=10000,
... shuffle=True,
... )
>>> loader = factory.get_iterator(global_step=0, length=100)
>>> for batch in loader:
... # Training loop
... passbuild_iter(global_step: int = 0, length: int | None = None) → DataLoader
Get a DataLoader for a specific range of batches.
Supports endless epochs by wrapping around when batches are exhausted. If the requested length exceeds remaining batches, it will continue from the beginning.
- Parameters:
- global_step – Starting batch index (must be non-negative).
- length – Number of batches to include (must be positive).
- Returns: DataLoader that iterates over the specified batch range.
- Raises:ValueError – If validation fails or no batches available.
load_iterator_state(loader_state: str)
Load iterator state from a file.
- Parameters:loader_state – Path to the iterator state file
- Raises:
- FileNotFoundError – If the state file doesn’t exist
- KeyError – If required keys are missing in the state file
save_iterator_state(loader_state: str)
Save the current state of the iterator to a file.
- Parameters:loader_state – Path to save the iterator state file
