PBS Batch Serial Job Submission with Slot Limit

The situation: You have an “embarrassingly parallel” set of tasks that do not need to communicate with each other. They are currently configured to run in serial, but their combined runtime would exceed the allowed queue time (or your patience).

Goal: Submit a batch of serial jobs to a PBS queue in a sensible way.

Your old job script, my_job.sh:

#!/bin/bash
#PBS -l walltime=1:00:00
#PBS -l nodes=1:ppn=1

cd $PBS_O_WORKDIR

for number in ${1..1000}
do
  ./my_program -n ${number}
done

As usual, you would submit this with qsub my_job.sh.

New idea:
Most PBS systems support job arrays with the -t flag:

qsub my_job.sh -t 0-1000

The index of each job is passed into my_job.sh via the $PBS_ARRAYID environment variable, so that the job script can be revised:

#!/bin/bash
#PBS -l walltime=1:00:00
#PBS -l nodes=1:ppn=1

cd $PBS_O_WORKDIR

./my_program -n $PBS_ARRAYID

Now you should have 1000 jobs running. They are “parallel” in the sense that they will run simultaneously, but are not using MPI. The jobs will appear as an array to save you the hassle of scrolling through all of them:

jdh366@thecube ~/demos-meeting/serial-batch
$ qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
117046[].thecube           my-job-script.sh jdh366                 0 R default  

The stdout from each job will be saved in files named my-job-script.sh.o117046-<i> where <i> is the index of the job.

What if I have more jobs than processors?
You have two choices. First, you could submit all of them at the same time. PBS will run as many jobs as it can until they are finished. However, this is likely to occupy the entire system with many short serial jobs, which people won’t like.

The second option is to specify a “slot limit” using the % character in the qsub command like so:

qsub my_job.sh -t 0-1000%50

This command will run a maximum of 50 serial jobs at a time. When these finish, it will run another 50 and so on until the 1000 are complete. This avoids using the entire system at once.

For more HPC submission script examples, see here:
https://github.com/jdherman/hpc-submission-scripts