Table of Contents | ||
---|---|---|
|
Processes and Jobs (sbatch, srun, and more)
In computing, a process is an instance of a computer program that is being executed. It contains the program code and its current activity. A process is normally launched by invoking it by the name of the executable (compiled code) associated with it, either directly at the Unix shell prompt, or within a shell script.
...
Warning |
---|
When a user ignores that recommendation and executes processes that are compute-intensive, longer than momentary, and especially requiring multiple cores, the login node becomes overloaded, preventing other users from doing their regular work. In such cases, we typically terminate the processes and inform the user about it with a request to run the processes within jobs instead. Please be aware that the cluster is a shared resource and cooperate with us on trying to limit the computing activity on the head node to minimum. |
If you need to run a CPU intensive process, please start an interactive job as described bellow:
https://confluencecolumbiauniversity.columbiaatlassian.edunet/confluencewiki/display/rcs/Insomnia+-+Submitting+Jobs#Insomnia-SubmittingJobs-InteractiveJobs
Jobs can request compute resources on a per-core basis or a per-node basis.
...
Code Block |
---|
#SBATCH --nodes=1 |
You should may specify a memory requirement when requesting use of a node.
Code Block |
---|
#SBATCH --mem=187G510G # Standard nodes have approximately 187G510G of total memory available |
...
To explicitly request a "standard node" with 192 and 400 GB RAM, you may specify the mem192 mem400 feature.
Code Block |
---|
#SBATCH -C mem192 mem400 |
If your job requires more than the standard 192 512 GB then you may optionally add a constraint to request one of the cluster's high memory nodes, each of which has 768 GB 1024 GB (i.e. 1TB) of memory. The feature to request is "mem768mem1024".
Code Block |
---|
#SBATCH -C mem768mem1024 |
or
Code Block |
---|
#SBATCH --constraint=mem768mem1024 |
Please keep in mind that the above directives will only secure the corresponding type of a node but do not ensure that all its memory is available to the job (even if one specifies its --exclusive usage). By default, 5,800 MB is allocated to each core.
Interactive Jobs
Interactive jobs allow user interaction during their execution. They deliver to the user a new shell from which applications can be launched.
...
Code Block |
---|
srun --pty -t 0-01:00 -C mem768mem1024 -A <ACCOUNT> /bin/bash |
If a node is available, it will be picked for you automatically, and you will see a command line prompt on a shell running on it. If no nodes are available, your current shell will wait.
...
Directive | Short Version | Description | Example | Notes |
---|---|---|---|---|
--account=<account> | -A <account> | Account. | #SBATCH --account=stats | |
--job-name=<job name> | -J <job name> | Job name. | #SBATCH -J DiscoProject | |
--time=<time> | -t <time> | Time required. | #SBATCH --time=10:00:00 | The maximum time allowed is five days. |
--mem=<memory> | Memory required per node. | #SBATCH --mem=16gb | ||
--mem-per-cpu=<memory> | Memory per cpu. | #SBATCH --mem-per-cpu=5G | ||
--constraint=mem768mem1024 | Specifying a large (768 1024 GB RAM) node. | #SBATCH -C mem768mem1024 | ||
--cpus-per-task=<cpus> | -c <cpus> | CPU cores per task. | #SBATCH -c 1 | Nodes have a maximum of 24 cores. |
--nodes=<nodes> | -N <nodes> | Nodes required for the job. | #SBATCH -N 4 | |
--array=<indexes> | -a <indexes> | Submit a job array. | #SBATCH -a 1-4 | See below for discussion of job arrays. |
--mail-type=<ALL,BEGIN,END,FAIL,NONE> | Send email job notifications | #SBATCH --mail-type=ALL | ||
--mail-user=<email_address> | Email address | #SBATCH --mail-user=me@email.com |
Walltime
The walltime is specified with "-t" flag. For example:
#SBATCH -t 10:00:00
That is walltime format that translates to 10 hours (00 minutes and 00 seconds).
If you want to request just 1 hour walltime, you should request 1:00:00
Acceptable time formats in Slurm scheduler are: "minutes", "minutes:seconds", "hours:minutes:seconds",
"days-hours", "days-hours:minutes" and "days-hours:minutes:seconds".
The maximum time allowed is five days.
Memory Requests
There are two ways to ask for memory and they are are mutually exclusive. You can ask either for
1) memory per cpu
or
2) memory per node
If you do not specify the memory requirement, by default you get 5,800 MB per CPU.
At this writing (February 2024), Insomnia has 191 25 Standard Nodes with 192 with 512 GB of memory (about 187 approx 510 GB usable).
Insomnia's 56 9 high memory nodes have 768 GB 1TB of memory each. They are otherwise identical to the standard nodes.
For example,
--mem-per-cpu=5gb
Minimum memory required per allocated CPU. If you request 32 cores (one node) you will get 160gb of memory on both standard node and on high-memory node.
If you specify the real memory required per node:
--mem=160gb
You will get the same.
However, if you specify
#SBATCH --exclusive
#SBATCH -C mem768 mem1024
#SBATCH --mem=700gb900gb
You will get 700 900 gb on high memory node.
Job Arrays
Multiple copies of a job can be submitted by using a job array. The --array option can be used to specify the job indexes Slurm should apply.
...
See the Slurm Quick Start Guide for a more in-depth introduction on using the Slurm scheduler.
Lustre Background and Basics
The Insomnia cluster utilizes Lustre, which is a robust file system that consists of servers and storage. A Metadata Server (MDS) tracks metadata (for example, ownership and permissions of a file or directory). Object Storage Servers (OSSs) provide file I/O services for Object Storage Targets (OSTs), which host the actual data storage. An OST is typically a single disk array. A notional diagram of a Lustre file system is shown in Figure 1, with one MDS, three OSSs, and two OSTs per OSS for a total of six OSTs. A Lustre parallel file system achieves its performance by automatically partitioning data into chunks, known as “stripes,” and writing the stripes in round-robin fashion across multiple OSTs. This process, called "striping," can significantly improve file I/O speed by eliminating single-disk bottlenecks.
For jobs generating a large amount of I/O requests which are relatively small in size, e.g., jobs using Python, bedtools, Ancestry HMM or even Matlab, you can use fine-grained approach with Lustre Progressive File Layout.
$ mkdir workdir001
$ lfs setstripe -E 4M -c 1 -E 128M -c 4 -E -1 -c -1 workdir001
In the above example we are first creating a directory called "workdir001
" and then setting a striping policy where files smaller than 4 MB get one stripe, Files between 4 MB and 128 MB will get a stripe of 4. Files larger than 4 MB will get the default of -1, i.e., determined automatically by Lustre. Any files you create under workdir001
will inherit the new striping policy. Subdirectories will not inherit the striping policy of the parent directory if they are created prior to setting the striping policy. The subdirectories will inherit the striping policy set on the parent directory if they are created after the striping policy is set.
The lfs getstripe Command
The "lfs getstripe
" command reports the stripe characteristics of a file or directory.
Syntax:
$ lfs getstripe [--stripe-size] [--stripe-count] [--stripe-index] <directory|filename>
Example:
$ lfs getstripe MyDir
MyDir
stripe_count: 1 stripe_size: 1048576 stripe_offset: -1
The output shows that files created in the directory MyDir will be stored using one stripe of 1048576 bytes (1 MB) per block unless explicitly striped otherwise before writing. The stripe_offset (also known as stripe index) of -1 means that each file will have an OST placement determined automatically by Lustre.
The setstripe command does not have a recursive option, however you could use the find command. Here is an example that would set the stripe with Progressive File Layout.
...