SLURM User Guide

Note: This guide provides an introduction to the SLURM job scheduler and its application on the c2b2 clusters. The clusters comprise:

8 compute nodes, each with 20-core processors and 128 GB of memory
Some nodes have 192 cores and 1.5 TB of memory
1 GPU node featuring 2 NVIDIA L40s GPU cards
1 GPU node with a Superchip GH200 ARM architecture, 1 GPU, and 570 GB of memory

This guide will help you get started with using SLURM on these clusters.

Introductions

Jobs are executed in batch mode, without user intervention. The typical process involves

Logging into the login node ((link unavailable))
Preparing a job script that specifies the work to be done and required computing resources
Submitting the job to the queue
Optionally logging out while the job runs
Returning to collect the output data once the job is complete

This guide provides an introduction to submitting and monitoring jobs using SLURM. The covered topics include:

Essential SLURM commands
Creating a job submission script
Understanding SLURM partitions
Submitting jobs to the queue
Monitoring job progress
Canceling jobs from the queue
Setting environment variables
Managing job dependencies and job arrays

Commands

The following table summarizes the most commonly used SLURM commands:

Command	Description
sbatch	Submits a job script for execution.
sinfo	Displays the status of SLURM-managed partitions and nodes, with customizable filtering, sorting, and formatting options.
squeue	Shows the status of jobs, with options for filtering, sorting, and formatting, defaulting to priority order for running and pending jobs.
srun	Run a parallel job on cluster.
scancel	Cancels a pending or running job
sacct	Provides accounting information for active or completed jobs.
salloc	This command is used to allocate resources and submit an interactive job to Slurm, allowing users to execute tasks in real-time with manual input.

SLURM commands offer detailed documentation and guidance through their manual (man) pages, which can be accessed by typing, for example

`man sinfo`

Submission script

A submission script is a shell script that outlines the computing tasks to be performed, including the application, input/output, and resource requirements (e.g., CPUs, memory). A basic example is a job that needs a single node with the following specifications:

Uses 1 node
Runs a single-process application
Has a maximum runtime of 100 hours
Is named "MyHelloBatch"

Sends email notifications to the user when the job starts, stops, or aborts"

Example: job running on a single node

#!/bin/bash
#MyHelloBatch.slurm
#
#SBATCH -J test                           # Job name, any string
#SBATCH -o job.%j.out                     # Name of stdout output file (%j=jobId)
#SBATCH -N 1                              # Total number of nodes requested
#SBATCH -n 8                              # Total number of cpu requested
#SBATCH -t 01:30:00                       # Run time (hh:mm:ss) - 1.5 hours
#SBATCH -p highmem                        # Queue name. Specify gpu for the GPU node.
#SBATCH --mail-user=UNI@cumc.columbia.edu # use only Columbia address
#SBATCH --mail-type=ALL                   # send email alert on all events
 
module load anaconda/3.0                  # load the appropriate module(s) needed by
python hello.py                           # you program