Batch Jobs

Batch processing involves non-interactively running a predefined script on cluster resources, with the cluster scheduler deciding when your predefined script runs depending on resource availability. In this computing paradigm, you essentially write down a set of commands you want to be executed in sequence, drop your program into a bin, and then the cluster will execute what you have written down as early as possible based on the resources that you have requested. This approach is useful when you have a good idea of what you want to do in advance / have an established workflow and don't need to interact directly with programs in a command line interface. It is also useful when you want to vary a large number of parameters and run them through the same steps (see Job Arrays).

Writing A Batch Job Script

Below is an example of a simple batch script that runs a series of Unix shell commands. Namely, it runs the echo command, which prints the text "Hello World" out to the screen, the sleep command, which pauses execution for a configurable number of seconds (10 seconds in the example), and the date command, which prints out the current date and time.

In the beginning of the batch script, you'll notice a number of lines that begin with #SBATCH. #SBATCH is a special keyword (a directive) that tells the batch scheduler what resources you want your job to have and how you want your job to run. After you write #SBATCH, you can use any of the options described here to customize the job's execution environment. For a list of popular options, see the table below the code sample. The job below requests 1 CPU, 1 hour of time on the cluster, 1 GB of memory per CPU, and 1 NVIDIA GeForce 1080 Ti.

Simple batch script

#!/bin/sh
#
# Simple "Hello World" submit script for Slurm.
#
#SBATCH --job-name=HelloWorld  
#SBATCH -c 1                    
#SBATCH --time=1:00              
#SBATCH --mem-per-cpu=1gb        
#SBATCH --gres=gpu:gtx1080:1    

echo "Hello World"
sleep 10
date

# End of script

Common Batch Options

Option	Description
--job-name	A human-readable name for your job.
-c [NUMBER]	Number of CPU cores to use (where NUMBER is an integer value).
--time	The amount of time the job will run.
--mem-per-cpu [NUMBER]	The memory the job will use per CPU core (where NUMBER is an integer value).
--gres=gpu:[NUMBER]	The number of GPUs you wish to request (where NUMBER is an integer value).
--gres=gpu:[GPU TYPE]:[NUMBER]	If you need a specific type of GPU, you can request that your jobs only be run using a particular GPU make/model. Possible GPU types include gtx1080 and gtx2080. NUMBER is an integer value.

Submitting A Batch Job Script

Now that we have written our batch script, we can drop it into the metaphorical waiting bin by using the sbatch command, which uses the syntax sbatch [PATH TO SCRIPT].

Running the job

$ sbatch hello-world.sh
Submitted batch job 268

After our job is submitted, it is assigned a numerical code so that we can keep track of it. This numerical code is referred to as the job ID. In this case, our job was assigned a job ID of 268.

Querying the Status of A Batch Job Script

By running the squeue command we can see the status of our job, as well as all other jobs that are currently running on the cluster. By default squeue prints out a table with the job ID, the partition the job is submitted to / running on, the job's name, who submitted the job, its state (ST), how long the job has been running, how many nodes the job uses, and the node(s) that the job is running on (or, alternatively, a reason for why the job is not running if it is still in the queue). Job state codes are described in more detail here, although the most common codes are R (running), CG (completing), CD (completed), PD (pending), and CA (cancelled). The values for the state column can include Below we can see that the job was assigned to the node named ax02 and is running.

Checking the jobs status

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               268     burst HelloWor   aa3301  R       0:04      1 ax02
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

Reading Job Output

Jobs generate output files that show what text would have printed to the terminal if you had been running the commands in the batch script interactively. The default naming convention for the output is slurm-[JOB ID].out and the default location for the output is the directory you were in when you ran sbatch. Below we can review the output as shown using the cat command (which is used to print text from a file to the terminal window):

Reading the output

$ cat slurm-268.out
Hello World
Tue Apr  2 10:49:14 EDT 2019