espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.reduce.CTAReduce
Less than 1 minute
espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.reduce.CTAReduce
espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.reduce.CTAReduce(tid: int, x, storage, count: int, R_opid: int)
CUDA Warp reduction kernel.
It is a device kernel to be called by other kernels.
The data will be read from the right segement recursively, and reduced (ROP) onto the left half. Operation continues while warp size is larger than a given offset. Beyond this offset, warp reduction is performed via shfl_down_sync, which halves the reduction space and sums the two halves at each call.
NOTE
Efficient warp occurs at input shapes of 2 ^ K.
References
- Warp Primitives [https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/]
- Parameters:
- tid – CUDA thread index
- x – activation. Single float.
- storage – shared memory of size CTA_REDUCE_SIZE used for reduction in parallel threads.
- count – equivalent to num_rows, which is equivalent to alphabet_size (V+1)
- R_opid – Operator ID for reduction. See R_Op for more information.