Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Updated job time limit


Warning
titlesbatch: error: Batch job submission failed: Invalid account or account/partition combination

If you see the above error when submitting a job in any of the tutorials below, please send an email to rc@zi.columbia.edu as this a problem with some individual researcher accounts that we can fix on our end. This usually only happens with new individual accounts.

...

For labs with full node ownership, jobs are submitted to both the lab-specific partition and the burst partition by default. This means that when your job is being scheduled, Slurm first checks your own nodes for available resources before considering resources available within the burst partition.  If a researcher has access to multiple lab-specific partitions, Slurm will start jobs in the first lab-specific partition in which requested resources are available.  You can remove the burst partition from the list of partitions that are evaluated by typing "sburst off" before submitting a jobto ensure that job is only submitted to the lab-specific partition(s).  The burst partition can be toggled back on by typing "sburst on".  The next section will explain why you may want to remove the burst partition from the set of partitions your job is submitted to.

...

On our cluster, we use a priority queue algorithm, but have added some custom modifications that cause this algorithm to behave like a multi-level feedback queue algorithm.  This means that any given unscheduled job cannot block other unscheduled jobs from being evaluated for resource allocation, making the cluster more responsive.  Because our setup is ultimately based upon Slurm's built-in priority queue algorithm, Slurm also does not allocate an entire node to jobs and is capable of dealing with resources at a more fine-grained level.  With this configuration, Slurm also makes no assumptions about job start and end times.  In brief, our setup gives us fine-grained allocation of resources (i.e., the ability to request discrete CPU cores and GPUs), responsiveness, and is robust to inaccurate job time estimates.

Preemption

The burst partition is primarily intended as a means to prevent cluster resources from sitting idle.  It is similar in principle to AWS Spot Instances in that it allows you to make use of unused cluster capacity by using other labs' nodes, with the caveat that your job will be canceled and requeued if a member of the lab that owns the node that your job is running on decides to submit a job to his/her own node.  This means that by default, it's possible that your job may be interrupted and will be restarted from the beginning if it is running on a node that doesn't belong to your group.  Thus, you might want to ensure that burst isn't used if the job you are running is not robust to being stopped and requeued (e.g., if it's not using checkpointing) or if you want to ensure that it never gets preempted in this manner.

Default Resource Allocations

...

Job Time Limits

By default all jobs have no 10 day time limit imposed upon them.  For the shared and burst partitions the time limit is 5 days.  It is highly recommended that you use checkpointing functionality to ensure that jobs that need more than 10 days can resume at the point that they left off at if they are interrupted.  Some links to checkpointing implementations within PyTorch and Tensorflow/Keras can be found here:

Quality of Service

If you have a job that you believe is not being scheduled quickly enough, you can use the sboost command to place your job at the top of the pending job queue for evaluation:

...