espnet2.speechlm.dataloader.dataset.CombinedDataset
Less than 1 minute
espnet2.speechlm.dataloader.dataset.CombinedDataset
class espnet2.speechlm.dataloader.dataset.CombinedDataset(datasets: List[Tuple[str, str]] = [], registered_datasets: List[str] = [], num_worker: int = 1, rank: int = 0, world_size: int = 1)
Bases: Dataset
Combined ESPnet Speech Language Model Dataset.
Combines multiple datasets from both direct paths and registered datasets.
- Parameters:
- datasets – List of (name, json_path) tuples for direct dataset paths (default: [])
- registered_datasets – List of registered dataset names to look up in registry (default: [])
- num_worker – Number of parallel workers for loading datasets.
- rank – Process rank for distributed training (default: 0)
- world_size – Total number of processes (default: 1)
property dataset_names : List[str]
Return list of all dataset names.
get_all_examples() → Dict[str, List[str]]
Return all examples as a dictionary mapping dataset names to sample IDs.
- Returns: Dictionary mapping dataset names to lists of sample IDs
verify_subset_entries(task, data_name, required_entries)
Verify that a dataset contains all required entries for a task.
