Often, we want to run a workflow that contains many different steps. Let’s use the de novo transcriptome pipeline we provide as an example.
In this pipeline, we run a few initial set up steps first (normalization, etc), then several series of steps in parallel (several assemblers, but each with its own series of steps), then a final set of steps that must be run in sequence. Managing these jobs by hand can be a little tiresome if this is something you are trying to do at a scale of say 40 samples.
NOTE: DO NOT DO THIS UNTIL YOU HAVE A SOLID WORKFLOW IN PLACE. Our pipeline is not converted to run as we will discuss here for several reasons:
- I want to force people to look at what they are doing
- I have no way to predict exactly what resources any given user will need, given the heterogeny of input libraries.
If you are running the SAME pipeline on very similar data for many different samples of the same project, go for it ^_^. But test throughouly ahead of time.
So, let’s look at just running one of the assemblers with multiple steps – Velvet. In this step we have three steps – velvetg, velveth, and oases – each with their own job files – RunVelvet1, RunVelvet2, and RunVelvet3. RunVelvet1 takes a some memory, RunVelvet2 takes a lot, and RunVelvet3 takes very little. And just to make things just a bit more complicated, we also have RunVelvet1b, RunVelvet2b, and RunVelvet3b because we have split the workload across two job files to keep resource requests lower and therefore move through the queue faster.
Summary:
RunVelvet1 Needs moderate memory
RunVelvet1b Needs moderate memory
RunVelvet2 Needs high memory, RunVelvet1 results
RunVelvet2b Needs high memory, RunVelvet1b results
RunVelvet3 Needs low memory, RunVelvet2 results
RunVelvet3b Needs low memory, RunVelvet2b results
Because I’ve done this a hundred times, I don’t want to have to manually track each job and submit the next step when the current one is done. Wouldn’t it be nice if I could just do that in the job file? On Carbonate we can!
You could just add “qsub RunVelvet2” to the end of RunVelvet1, and “qsub RunVelvet3” to the end of RunVelvet2. However, these jobs will submit no matter if the first one completes or not. You can run into issues with your queue priority dropping and waiting unnecessarily long for jobs to run if you are submitting jobs that are going to fail because the previous step wasn’t complete.
Also, many clusters do not let you submit jobs from compute nodes, meaning you cannot submit a job from within a job.
So instead you can create a driver file. This is basically a wrapper script that can submit and monitor jobs in a workflow. If a job completes, it will return an exit code of 0 (usually, though there are exceptions to this). You see this in your emails from the job scheduler. If it exits with a different code, you don’t want the next job to be submitted.
So let’s start with the familiar – a qsub command:
qsub RunVelvet1.sh
id=$(qsub RunVelvet1.sh)
echo $id
#!/bin/bash
id=$(qsub RunVelvet1.sh) qsub -W depend=afterok:$id RunVelvet2.sh
The first line saves the automatically returned job id of the submitted job to a variable.
Also, wait… qsub has options?! Yup! Check out man qsub if you are curious! In this case, the -W is the additional attributes flag which allows us to specify when this job should run. depend=afterok (after ok) will only launch the job if the given job id (in this case RunVelvet1’s job id) returns without error (exit 0).
Here is the driver file with three steps:
#!/bin/bash id=$(qsub RunVelvet1.sh) id2=$(qsub -W depend=afterok:$id RunVelvet2.sh) qsub -W depend=afterok:$id2 RunVelvet3.sh
Here we capture of the job id of the second job for use in launching the third. Notice, you don’t need to save the jobid of the third step, because it is the final step and has nothing waiting on it to finish.
To scale this up to run both 1 and 1b, 2 and 2b, 3 and 3b, you can do the following:
#!/bin/bash id=$(qsub RunVelvet1.sh) id2=$(qsub -W depend=afterok:$id RunVelvet2.sh) qsub -W depend=afterok:$id2 RunVelvet3.sh idb=$(qsub RunVelvet1b.sh) id2b=$(qsub -W depend=afterok:$idb RunVelvet2b.sh) qsub -W depend=afterok:$id2b RunVelvet3b.sh
Because all the jobs are submitted immediately (they are just waiting to become active until the previous finishes) you can just add the second set. RunVelvet1 and RunVelvet1b will both run together, once one of them finishes, the subsequent job will automatically be queued.
Make your driver file executable and run it in line:
chmod 755 Driver.sh ./Driver.sh
A set of example files with the same names and set up are available here. This is written for Carbonate@IU, but the general example will work for any PBS-like job handler. Feel free to contact help@ncgas.org if you want help with a different job handler (i.e. SLURM).
Benefits
- If you make job files for each step (HIGHLY recommended), you end up with resource requests that are smaller (go through queue faster) and specific to that job. If you combined all the scripts for Velvet steps 1, 2, and 3 in one file (which you can), you’d have to request much longer time, much higher memory (only steps 1 and 2 require it) and more processors. It takes much more time to get large blocks of resources than to get several small ones.
- You lose zero time in subsequent jobs being submitted.
- If a job fails, there is no lost priority or time in the queue.
Pro tips
Output
One additional thing that makes this easier to monitor is to have output in your job files that tell you what is going on. For instance, in RunVelvet1, I’d put the following line before any other commands run:
echo "Velvet1 started" >> log
and the following as the last line in the file:
echo "Velvet1 job file exiting" >> log
This output will be captured by the .o file from the driverfile job, meaning you will get an output in a file named log that looks like:
Velvet1 started
Velvet1b started
Velvet1 job file exiting
Velvet2 started
Velvet1b job file exiting
Velvet2b started
Velvet2b job file exiting
Velvet3b started
Velvet2 job file exiting
Velvet3 started
Velvet3 job file exiting
Velvet3b job file exiting
Make sure you use “>>” and not “>” or you will only capture the last output. “>>” appends to a file, “>” overwrites.
Quality checks
It can be really useful to have metrics on the output of each step, which can be rolled into the job file. Having RunVelvet2b output how many contigs it found is helpful. Having a Trinity job output how many sequences it assembled, etc can also be helpful. This can guard against jobs exiting with a 0 code, even if it isn’t real. If you don’t see something you expect, you can halt the whole workflow and investigate.
General tips in developing a workflow
- For each step:
- make a small input – only a few lines are needed
- run the command on the log in node – this should run very quickly with a small file
- move that file into a job file, make the job file executable, and run it (i.e. ./RunVelvet1)
- submit job with small input, requesting very little resources as it doesn’t need much
- scale to large sample, submit with slightly higher than expected resources – then adjust based on what it uses (it’s emailed to you!)
- add output to beginning and end of job file to declare it starting and ending
- Make a map of what jobs needs to depend on what outputs.
- Create and test a driver file. Include output that the workflow is beginning and ending.
- Rejoice!