Using job scheduling system

About 2 min

Using job scheduling system

Our recipes support some Job scheduling systems, SGE, PBS/Torque, and Slurm, according to Parallelization in Kaldi. By default, the job runs at local machine. If there are any Job scheduling systems in your environment, you can submit more number of Jobs with multiple machines.

Please ask the administrator to install it if you have multiple machines.

Select Job scheduler

cmd.sh is a configuration file and it's used by run.sh to set some shell variables. These shell variables should be set as one of following perl scripts:

cmd	Backend	configuration file
run.pl	Local machine (default)	-
queue.pl	Sun grid engine, or grid endine like tool	conf/queue.conf
slurm.pl	Slurm	conf/slurm.conf
pbs.pl	PBS/Torque	conf/pbs.conf
ssh.pl	SSH	.queue/machines

Usage of run.pl

run.pl, queue.pl, slurm.pl, pbs.pl and ssh.pl have a unified interface, therefore we can assign any one of them to ${cmd} in the shell script:

nj=4
${cmd} JOB=1:${nj} JOB.log echo JOB

JOB=1:${nj} indicates the parallelization, which is known as "array-job", with ${nj} number of jobs. JOB.log is a destination of the stdout and stderr from jobs. The string of JOB will be changed to the job number if it's included in the log file name or command line arguments. i.e. The following commands are almost equivalent to the above:

echo 1 &> 1.log &
echo 2 &> 2.log &
echo 3 &> 3.log &
echo 4 &> 4.log &
wait

Configuration

You also need to modify the configuration file for a specific job scheduler to change command-line options to submit jobs e.g. queue setting, resource request, etc.

The following text is an example of conf/queue.conf.

# Default configuration
command qsub -v PATH -cwd -S /bin/bash -j y -l arch=*64*
option mem=* -l mem_free=$0,ram_free=$0
option mem=0     # Do not add anything to qsub_opts
option num_threads=* -pe smp $0
option num_threads=1 # Do not add anything to qsub_opts
option max_jobs_run=* -tc $0
default gpu=0
option gpu=0
option gpu=* -l gpu=$0 -q g.q

Note that the queue/partition name, -q g.q, is an example, so you must change it to the existing queue/partition in your cluster.

You can't use the specific options depending on each system in our scripts, e.g. you can't use -q option for queue.pl directly. Instead, you can use --mem, --num_threads, --max_jobs_run, and --gpu in this case.

Take a look at the following:

option gpu=* -l gpu=$0 -q g.q

This line means that the optional argument specified by the second column, gpu=*, will be converted to the options after it: -l gpu=$0 -q g.q:

queue.pl --gpu 2

will be converted to

qsub -v PATH -cwd -S /bin/bash -j y -l arch=*64* -l gpu=2 -q g.q

You can also add a new option for your system using this syntax.

option foo=* --bar $0