espnet2.speechlm.model.speechlm.parallel_utils.parallel_dims.init_parallel_dims
espnet2.speechlm.model.speechlm.parallel_utils.parallel_dims.init_parallel_dims
espnet2.speechlm.model.speechlm.parallel_utils.parallel_dims.init_parallel_dims(titan_config: Dict[str, Any]) → Tuple[ParallelDims, int, int]
Create ParallelDims for distributed training.
Supports FSDP2 (dp_shard), HSDP (dp_replicate), pipeline parallelism (pp), and expert parallelism (ep).
The constraint dp_replicate * dp_shard * pp == world_size is enforced by TorchTitan; dp_shard=-1 auto-computes the remainder.
EP borrows from the FSDP dimension — it does NOT consume additional world_size. TorchTitan internally computes efsdp = dp_shard / ep for the expert FSDP mesh. For example, with 8 GPUs and ep=8: dense params use fsdp=8, expert params use efsdp=1 + ep=8.
This function assumes:
- torch.distributed is already initialized (via dist.init_process_group)
- CUDA device is already set (via torch.cuda.set_device)
Parameters:titan_config –
TorchTitan configuration dictionary containing:
- dp_replicate: HSDP replicate degree (default: 1)
- dp_shard: FSDP sharding degree (-1 = auto, default: -1)
- pp_degree: Pipeline parallel degree (default: 1)
- ep: Expert parallel degree (default: 1). Must divide
dp_shard evenly.
Returns:
- parallel_dims: ParallelDims object with device meshes built
- local_rank: Local rank within the node (current CUDA device)
- global_rank: Global rank across all nodes
Return type: Tuple of (parallel_dims, local_rank, global_rank)
