In one of our current projects, we’re running a bunch of batch jobs simultaneously. When a single job enters the queue, everything is fine. But when 15-20 of our jobs enter the queue at the same time, the filesystem slows to a crawl, and the jobs wind up exceeding the walltime.
So what we’d like to do is ensure that only a certain number of these jobs can run simultaneously. The brute force solution is to sit there and stare at the job queue, but that’s not very appealing. Instead we can use PBS job dependencies. (This is very helpful, and I can’t believe I’m just learning about it now).
Job dependency works like this:
qsub -N job_name -W depend=afterok:other_job_id my_job_script.sh
The -W argument to qsub just means “other options”. There are a number of different dependency options you can give; see this discussion for a more complete list. Here I’m using afterok, which means “only run after this other job has completed without errors”. In addition to solving our main problem, this also ensures that we won’t keep running the entire set of submissions if there’s an error in one of the jobs. Note that you can append multiple job ids to the afterok option, like this:
qsub -N job_name -W depend=afterok:job1:job2:job3 my_job_script.sh
Here is an example of submitting 200 separate jobs in a loop, where I only want to run a maximum of 7 at a time. The jobs are divided into “globs” (my non-technical term) where each glob will not start running until the previous glob has completed without errors.
Caveat emptor: I am still in the process of testing this. But regardless of this particular implementation, it’s a cool idea that should be useful in the future.