Slurm Overview

sbatch: error: Batch job submission failed: Invalid account or account/partition combination

If you see the above error when submitting a job in any of the tutorials below, please send an email to rc@zi.columbia.edu as this a problem with some individual researcher accounts that we can fix on our end. This usually only happens with new individual accounts.

Description

Slurm is an open source job scheduler that brokers interactions between you and the many computing resources available on Axon.  It allows you to share resources with other members of your lab and other labs using Axon while enforcing policies on cluster usage and job priority.  You may be familiar with other tools in scientific computing such as Sun Grid Engine or HTCondor with similar functionality.  These tools come from a long tradition of utilities centered around batch processing, although Axon's Slurm configuration allows you to create interactive sessions as well.  Their main purpose is to fairly distribute finite computing resources according to pre-defined rules and algorithms.  In operating systems (such as Linux, Windows, and Mac OS X) the same or similar algorithms are used to allow you to run multiple applications at once on a single computer (in contrast to a cluster of multiple machines like Axon).

Scheduler Configuration and Rules

While the availability of scheduling algorithms that Slurm can use and their underlying implementations are ultimately handled by the developers of Slurm (SchedMD), there are some aspects of the scheduler behavior that we can modify through our particular configuration.  The following section provides an outline of scheduler behaviors that we have customized for Axon

Job Routing

In Slurm terminology, a partition is a set of nodes that a job can be scheduled on.  A node can belong to more than one partition, and each partition can be configured to enforce different resource limits and policies.

On Axon, there are three main partitions that you may encounter:

  • If you have purchased whole nodes for the cluster, you will have a lab-specific partition consisting of these nodes.  This partition uses the same name as your lab account (e.g., ctnnklabissa).
  • The shared partition, which consists of a node that Zuckerman Research Computing has purchased using its own budget.  This partition is used for Axon's rental tier.
  • The burst partition, which consists of all nodes in the cluster.

A job can be submitted to multiple partitions and will execute on the first partition that has all of the resources requested available.  Each partition has a numerical weight associated with it (illustrated in the table below).  When a job is being evaluated by Slurm for scheduling, Slurm checks higher-weighted partitions for free resources first.

PartitionWeight
Lab-Specific Partition20
Shared10
Burst5

For labs using the rental tier, jobs are only submitted to the shared partition by default.  This means that jobs will only run within shared and will be unable to burst to use unused capacity on the cluster.

For labs with full node ownership, jobs are submitted to both the lab-specific partition and the burst partition by default. This means that when your job is being scheduled, Slurm first checks your own nodes for available resources before considering resources available within the burst partition.  If a researcher has access to multiple lab-specific partitions, Slurm will start jobs in the first lab-specific partition in which requested resources are available.  You can remove the burst partition from the list of partitions that are evaluated by typing "sburst off" before submitting a job to ensure that job is only submitted to the lab-specific partition(s).  The burst partition can be toggled back on by typing "sburst on".  The next section will explain why you may want to remove the burst partition from the set of partitions your job is submitted to.

The sburst command works by setting environment variables within your shell session.  These environment variables override partition requests defined within a batch script- this means that adding the line "#SBATCH -p [PARTITION_NAME]" to your batch script will not work on axon.  Instead, you can override sburst's environment variables by running "sbatch -p [PARTITION_NAME]" rather than "sbatch" without any flags when you submit your job.  The source code for sburst can be inspected by the curious (using one's favorite text editor) at /etc/profile.d/sburst.sh.

For reference, here is the relevant bit describing this behavior from the sbatch manual:

Note that environment variables will override any options set in a batch script, and command line options will override any environment variables.

Scheduling Algorithm

This section is a deep dive into the technical details of how Axon makes scheduling decisions.  It is available if you are interested in such details, but it is not necessary to read this if you only want to learn how to use Axon.


By default, the SLURM scheduler can use one of two algorithms to schedule jobs on the cluster:

The backfill algorithm, which is the default on many other SLURM clusters, attempts to schedule low priority jobs if they do not prevent higher priority jobs from starting at an expected start time.  One problem with this algorithm is that it is highly dependent upon how diligent other cluster users are in setting the --time= parameter in their submission scripts.  Essentially, it focuses on optimizing start and end times, but it assumes that everyone is accurately estimating the time their jobs will take and programming this estimate into their job scripts.  Another problem with this algorithm is that when a job is scheduled to execute on a node the job is allocated an entire node (even if it's not using all the resources on the node).  Basically, although this algorithm can be an ideal optimization for scheduling jobs under certain circumstances, the granularity it uses when considering resources is very coarse and it makes many assumptions about how job scripts are programmed that may not hold true.

The priority queue algorithm uses a global queue of all jobs that have been submitted to the cluster.  This queue is ordered by the priority score assigned to each job with higher priority jobs in front.  For each job at the front of the queue, Slurm makes a decision about how to assign resources to the job before removing the job from the queue and placing it on a server.  This means that occasionally, under circumstances where the highest priority job cannot be allocated resources, the job at the front of the queue can block Slurm from trying to find enough resources for lower priority jobs.  This can cause Slurm to not be very responsive.

On our cluster, we use a priority queue algorithm, but have added some custom modifications that cause this algorithm to behave like a multi-level feedback queue algorithm.  This means that any given unscheduled job cannot block other unscheduled jobs from being evaluated for resource allocation, making the cluster more responsive.  Because our setup is ultimately based upon Slurm's built-in priority queue algorithm, Slurm also does not allocate an entire node to jobs and is capable of dealing with resources at a more fine-grained level.  With this configuration, Slurm also makes no assumptions about job start and end times.  In brief, our setup gives us fine-grained allocation of resources (i.e., the ability to request discrete CPU cores and GPUs), responsiveness, and is robust to inaccurate job time estimates.

Preemption

The burst partition is primarily intended as a means to prevent cluster resources from sitting idle.  It is similar in principle to AWS Spot Instances in that it allows you to make use of unused cluster capacity by using other labs' nodes, with the caveat that your job will be canceled and requeued if a member of the lab that owns the node that your job is running on decides to submit a job to his/her own node.  This means that by default, it's possible that your job may be interrupted and will be restarted from the beginning if it is running on a node that doesn't belong to your group.  Thus, you might want to ensure that burst isn't used if the job you are running is not robust to being stopped and requeued (e.g., if it's not using checkpointing) or if you want to ensure that it never gets preempted in this manner.

Default Resource Allocations

For each job you submit using Slurm, you can optionally specify the amount of resources necessary for the job (we will describe how to do this later).  If you do not explicitly specify a resource allocation when you submit your job, Slurm will use a set of default values.  Since the majority of jobs on Axon are training/inference sessions for machine learning, these default values are inspired by recommendations made by Tim Dettmers in his post "A Full Hardware Guide to Deep Learning".  This post's essential points can be summarized as follows:

  • The amount of RAM requested should match the amount of RAM present in the GPU (with perhaps a little extra RAM on top).
  • The ratio of CPUs:GPUs is dependent upon your preprocessing strategy. If you are preprocessing while you train, you should use 2 physical cores (4 logical cores) per GPU. If you preprocess your data before training, you should use 1 physical core (2 logical cores) per GPU. Adding more than 2 logical cores per GPU for the latter preprocessing strategy isn't worth it.

The default memory allocations for each partition, which is based in part off the recommendations above (and partially on the amount of resources available on each lab's nodes), are listed in the table below:

PartitionMemory per Logical Core (GB)
ctn10.5
nklab3.8
issa2.34
shared4
Partitions not explictly listed7

Axon currently has an upper limit for the amount of RAM that you can request per logical core.  This limit is 16 GB of RAM per logical core, and is intended to prevent jobs from using more memory than is available on a node.

In the current version of Slurm installed on Axon (18.08), the default number of logical cores allocated for all jobs across all partitions is 1 core- this option is not configurable.  A newer version of Slurm (19.05) that we are planning on installing during our next upgrade cycle will allow us to define a default number of logical cores allocated per GPU requested.  In keeping with Dettmers's recommendations, we are planning on a default of 2 logical cores per GPU, with the assumption that many (if not most) jobs will use the strategy of preprocessing data before training.  If this assumption is wrong and the other preprocessing strategy is more widely used, we can switch to a default of 4 logical cores per GPU.

Job Time Limits

By default all jobs have 10 day time limit imposed upon them.  For the shared and burst partitions the time limit is 5 days.  It is highly recommended that you use checkpointing functionality to ensure that jobs that need more than 10 days can resume at the point that they left off at if they are interrupted.  Some links to checkpointing implementations within PyTorch and Tensorflow/Keras can be found here:

Quality of Service

If you have a job that you believe is not being scheduled quickly enough, you can use the sboost command to place your job at the top of the pending job queue for evaluation:

sboost [JOB NUMBER]

Common Commands

Checking for Available Resources on the Cluster

The sfree command is an Axon-specific command that will display the total number of currently available resources on Axon:

(base) [jsp2205@axon ~]$ sfree
Node Name    CPUs Free    GPUs Free    Memory Free        Scratch Space Free
------------------------------------------------------------------------------------------
ax01        6 / 48        3 / 8        115 / 187        1787 / 1787
ax02        8 / 48        4 / 8        147 / 187        1787 / 1787
ax03        44 / 56        2 / 8        8 / 125        893 / 893
ax04        46 / 56        3 / 8        31 / 125        893 / 893
ax05        42 / 56        1 / 8        3 / 125        893 / 893
ax06        0 / 80        2 / 8        107 / 187        1862 / 1862
ax07        4 / 80        8 / 8        10 / 187        1862 / 1862
ax08        6 / 48        2 / 7        133 / 187        893 / 893
ax09        4 / 80        4 / 8        63 / 187        931 / 931
ax10        4 / 80        2 / 8        90 / 187        931 / 931

sfree is written specifically for Axon.  Other clusters that you may use, such as Habanero or Terremoto, will not have this command.

Monitoring disk usage in your individual home directory

You can use the homeusage command described on the Managing Files and Data page to easily determine your individual storage usage.

homeusage is written specifically for Axon.  Other clusters that you may use, such as Habanero or Terremoto, will not have this command.

Monitoring disk usage in across your lab's account

You can use the labusage command described on the Managing Files and Data page to easily determine your individual storage usage.

labusage is written specifically for Axon.  Other clusters that you may use, such as Habanero or Terremoto, will not have this command.

Seeing the list of jobs that have been submitted to Axon

The squeue command will print out a list of all current jobs that have been submitted to Axon:

squeue

The default output of squeue includes several important bits of information.  Namely, it shows:

  • Job ID: Each job submitted to Slurm is given a corresponding integer value to uniquely identify it.  This integer can be used as input to other Slurm commands to modify the behavior of the job in some way.
  • Partition: The partition(s) that the job is submitted to.
  • Name: A human-readable name that you have given to the job as part of your batch script.  If you are using Jupyter notebooks with sjupyter (see Jupyter Notebooks ) this will be "jupyter".
  • User: The UNI of the person who submitted a job.
  • State (ST): An indicator of whether or not a job is running (R) or still waiting to be scheduled / pending (PD).
  • Time: How long a job has been running.
  • Reason: If a job is not yet running on a node, this column indicates a reason for why that is so.

squeue also has a number of options that can modify its output.  You can see a full list of these options and their descriptions by typing squeue --help.

Reading the contents of job submission scripts for jobs past and present

After you submit a batch script to Slurm, it copies the contents of that script to a private file and saves this file for when the job is allocated resources.  This means that, if the batch script you've submitted has been modified between when you submitted a job and the present, your copy of the script may not be the same as the copy that Slurm uses to run a job.

To see what job script Slurm executed for a given job, you can use the following command:

sscript [JOB ID]

This will open up the job script in a pager program, which you can use to scroll up and down the text (using the up and down arrows or j and k).  To leave this pager program, type q.

Note that sscript can only show you the job scripts for jobs that you yourself have submitted.  If you try to read a job script that was not submitted by you, you may be presented with the following error message:

The archived job script for job ID [JOB ID] is not readable.
This could be because it was not submitted by you or because it has not yet been updated for read access in the script archive.
If the latter, please try again in a minute.
Exiting now.


sscript is written specifically for Axon.  Other clusters that you may use, such as Habanero or Terremoto, will not have this command.

Cancelling Jobs

The scancel job takes in a job number for input (which you can get from squeue) and cancels that job.

scancel [JOB ID]


Tutorials

Axon-Specific

The pages listed below provide a basic overview of how to use Slurm, with an emphasis on use cases common to axon.  Links to other good tutorials produced by external groups for other Slurm clusters are provided in the section at the bottom of this page.

Additional Tutorials