Job submission is a necessary part of using a cluster. To learn more about job submission and how to write a job script, go here. In this blogpost we will walk through the next question after submitting a job – how do we know our job completed and if so, was it successful? In some cases, the job could error out because the filepath to the input file was incorrect, the requested computer resources were not enough, or the resource request was too much to fit in any available queue. There are several possibilities, but here are a couple of quick steps can help narrow down the problem and see if its an easy fix or you need to reach out for help.
Another note to make in this blog post is that, while the commands/emails shown below are for PBS TORQUE job scheduler, the steps presented to debug are generally common even among other job schedulers like SLURM on Bridges.
Preferably, the below flags are included in your job submission script to notify you by email when the job starts, aborts or ends. Looking at the email is your first clue
#PBS -m abe #mail me when job: a – abort, b – begins, e – ends
#PBS -M <your email>
The first clue – check your email
An email similar to the one below will be sent to the entered email address
1. First, take a look at the exit code; exit status=0 is an indicator that nothing went wrong
Notice I don’t actually say success because there is always a few cases when the exit code says 0 but the job actually failed. So exit_status is just the first checkpoint
Some common exit codes from TORQUE resource manager (more information available here).
2. Next, take a look “resources used” lines. If the job was successful, the job should have used some amount of resources. Now take a look at the email here, what do you think?
The jobs actually used 0kb memory now that is another indication the job likely failed.
Also, notice the walltime, it ran only for a few seconds, so this is another indication the job didn’t complete.
This sums it up, the job number “340249” we are looking into in this post likely failed. Now, this is a start but where can we find more information, for example, a better description of why it failed?
Taking a look at job logs
The answer to the question “why did the job fail” will be written into the job log files (in most cases). If you are not sure what the job log files are. Let me explain that briefly
You would have noticed, every time you submitted a job, there are two files generated with the format <job name>.e<job number> and <job name>.o<job number>. The file path to these log files are actually written in the email sent as well.
Log back into the machine, and go to the file path mentioned in the email -usually, it is the place where you are when you launched the job script. You should see the two files listed in this directory, as shown below.
Note: this is not the exact directory in the email or the same tool or job ID, but the change was made so we can work through some errors and understand how to debug
quast.e244419 – job error log
quast.o244419 – job output log
quast.log – the log from quast program, not all programs write to a log. So this file may not be listed for other programs
quast.sh – job script that was submitted
Take a look at the error log first, here is a picture of the logfile.
Before you continue reading, tell me what do you think caused this error?
The reason the job failed was that the job could not open the entered input file “*.fastq”. The answer was this sentence “no such file or directory”
Now go back to the job script, check the file name and make sure the input file is entered correctly with the correct path.
Here is another exercise, what do you think went wrong with this job from the below job error log?
Walltime limit exceeded! So this job needed a more walltime. You should have been able to see this in the email log as well. Below is the email I received for this job- there is a sentence there suggesting the same.
It’s more complicated than that
The examples above were straightforward and easy to debug. There will be cases (most of the time) when you are not sure what the error is, or what the error log is even saying. In those cases, try these quick steps
- Search for the word “Error” – read that line and see if you can figure it out. If you are not sure try a quick google search with the line next to error with program name
- Look up the exit_status number (in your email, and only if it’s not 0) with program name on google
- Still not sure – send the HPC support team an email, or email us at firstname.lastname@example.org.
What to include in the email asking for help
- The program you are running with version
- The compute cluster you are running the program on
- The job script, job error logs (both the .e nd .o files, and any other outputs you have handy)
- If you can easily share your input files you used, that would be a big help for us too!
This will help the support team narrow down the error and run some tests themselves before sending you the fix. In some cases, the error may be unique to your dataset, and they will require you to give them access to your data securely.