SLURM User Guide

This guide provides an introduction to the SLURM job scheduler and its application on the Ganesha c2b2 cluster.

Introductions

Jobs are executed in batch mode, without user intervention. The typical process involves

  • Logging into the login node ((link unavailable))

  • Preparing a job script that specifies the work to be done and required computing resources

  • Submitting the job to the queue

  • Optionally logging out while the job runs

  • Returning to collect the output data once the job is complete


This guide provides an introduction to submitting and monitoring jobs using SLURM. The covered topics include:

  • Essential SLURM commands

  • Creating a job submission script

  • Understanding SLURM partitions

  • Submitting jobs to the queue

  • Monitoring job progress

  • Canceling jobs from the queue

  • Setting environment variables

  • Managing job dependencies and job arrays

Commands

The following table summarizes the most commonly used SLURM commands:

Command

Description

Command

Description

sbatch

Submits a job script for execution.

sinfo

Displays the status of SLURM-managed partitions and nodes, with customizable filtering, sorting, and formatting options.

squeue

Shows the status of jobs, with options for filtering, sorting, and formatting, defaulting to priority order for running and pending jobs.

srun

Run a parallel job on cluster.

scancel

Cancels a pending or running job

sacct

Provides accounting information for active or completed jobs.

salloc

This command is used to allocate resources and submit an interactive job to Slurm, allowing users to execute tasks in real-time with manual input.

SLURM commands offer detailed documentation and guidance through their manual (man) pages, which can be accessed by typing, for example

man sinfo

man sinfo

Submission script

A submission script is a shell script that outlines the computing tasks to be performed, including the application, input/output, and resource requirements (e.g., CPUs, memory). A basic example is a job that needs a single node with the following specifications:

  • Uses 1 node

  • Runs a single-process application

  • Has a maximum runtime of 100 hours

  • Is named "MyHelloBatch"

  • Sends email notifications to the user when the job starts, stops, or aborts"

Example 1: job running on single node

#!/bin/bash #MyHelloBatch.slurm # #SBATCH -J test # Job name, any string #SBATCH -o job.%j.out # Name of stdout output file (%j=jobId) #SBATCH -N 1 # Total number of nodes requested #SBATCH -n 8 # Total number of cpu requested #SBATCH -t 01:30:00 # Run time (hh:mm:ss) - 1.5 hours #SBATCH --mail-user=UNI@cumc.columbia.edu # use only Columbia address #SBATCH --mail-type=ALL # send email alert on all events module load anaconda/3.0 # load the appropriate module(s) needed by python hello.py # you program

A submission script begins with #!/bin/bash, indicating it's a Linux bash script. Comments start with #, while #SBATCH lines specify job scheduling resources for SLURM. Note that #SBATCH directives must be placed at the top of the script, before any other commands. The script requests resources, such as:

#SBATCH -N n or #SBATCH --nodes=n : specifies the number of compute nodes (only 1 in this case)

#SBATCH -t T or #SBATCH --time=T: sets the maximum walltime (hh:mm:ss format)

#SBATCH -J “name" or #SBATCH --job-name="name": assigns a job name

#SBATCH --mail-user=<email_address>: sends email notifications

#SBATCH --mail-type=<type>: sets notification options (BEGIN, END, FAIL, REQUEUE, or ALL)

The script's final section is a standard Linux bash script, outlining job operations. By default, the job starts in the submission folder with the same environment variables as the user. In this example, the script simply runs the python hello.py.

Example 2: job running on multiple nodes

To execute an MPI application across multiple nodes, we need to modify the submission script to request additional resources and specify the MPI execution command:

#!/bin/bash #MyHelloBatch.slurm # #SBATCH -J test # Job name, any string #SBATCH -o job.%j.out # Name of stdout output file (%j=jobId) #SBATCH -N 2 # Total number of nodes requested #SBATCH --ntasks-per-node=16 # set the number of tasks (processes) per node #SBATCH -t 01:30:00 # Run time (hh:mm:ss) - 1.5 hours #SBATCH -p highmem # Queue name. Specify gpu for the GPU node. #SBATCH --mail-user=UNI@cumc.columbia.edu # use only Columbia address #SBATCH --mail-type=ALL # send email alert on all events module load openmpi4/4.1.1 # load the appropriate module(s) needed by mpirun myMPICode # you program

The multi-node script is similar to the single-node one, with the key addition of #SBATCH --ntasks-per-node=m to reserve cores and enable MPI parallel processing.

Interactive Jobs

An interactive job is a type of job that provides a command-line interface, allowing users to interact with the application or debug issues in real-time, rather than simply running a script. To submit an interactive job, use the salloc command in Slurm. Once the job begins, you'll gain access to a command-line prompt on one of the assigned compute nodes, enabling you to execute commands and utilize the allocated resources directly on that node

image-20240907-200214.png

By default, jobs submitted through salloc will be allocated 1 CPU and 4GB of memory, unless specified otherwise. If your job requires more resources, you can request them using additional options with the salloc command. For instance, the following example demonstrates how to allocate 2 nodes, each with 4 CPUs and 4GB of memory.

image-20240907-200713.png

GPU Jobs

GPUs will not be assigned to jobs unless explicitly requested using specific options with sbatch or srun during the resource allocation process.

Options

Explanation

--gres

Generic resources required per node

--gpus

GPUs required per job

--gpus-per-node

GPUs required per node. Equal to the --gres option for GPUs.

--gpus-per-socket

GPUs required per socket. Requires the job to specify a task socket.

--gpus-per-task

GPUs required per task. Requires the job to specify a task count. This is the recommended option for the GPU jobs.

A simple example that uses GPU and prints the GPU information is shown below. You can download

#!/bin/bash ##Resource Request #SBATCH --job-name CudaJob #SBATCH --output result.out ## filename of the output; the %j is equivalent to jobID; default is slurm-[jobID].out #SBATCH --partition=gpu ## the partitions to run in (comma seperated) #SBATCH --ntasks=1 ## number of tasks (analyses) to run #SBATCH --gpus-per-task=1 ## number of gpus per task #SBATCH --mem-per-gpu=100M ## Memory allocated for the job #SBATCH --time=0-00:10:00 ## time for analysis (day-hour:min:sec) ##Load the CUDA module module load cuda ##Compile the cuda script using the nvcc compiler nvcc -o stats stats.cu ## Run the script srun stats