...
About HPC Cluster |
About HPC Cluster
The new HPC cluster at C2B2 is a Linux-based (Rocky9.4) compute cluster consisting of 62 Dell Server, 2 head nodes, and a virtualized pool of login (submit) nodes, 8 Weka storage nodes, is designed with the goals of running compute intensive AI workloads.
The clusters comprise:
SLURM commands offer detailed documentation and guidance through their manual (man) pages, which can be accessed by typing, for example
|
---|
Submission script
A submission script is a shell script that outlines the computing tasks to be performed, including the application, input/output, and resource requirements (e.g., CPUs, memory). A basic example is a job that needs a single node with the following specifications:
...
Uses 1 node
...
Runs a single-process application
...
Has a maximum runtime of 100 hours
...
Is named "MyHelloBatch"
...
Code Block | ||
---|---|---|
| ||
#!/bin/bash
#MyHelloBatch.slurm
#
#SBATCH -J test # Job name, any string
#SBATCH -o job.%j.out # Name of stdout output file (%j=jobId)
#SBATCH -N 1 # Total number of nodes requested
#SBATCH -n 8 # Total number of cpu requested
#SBATCH -t 01:30:00 # Run time (hh:mm:ss) - 1.5 hours
#SBATCH --mail-user=UNI@cumc.columbia.edu # use only Columbia address
#SBATCH --mail-type=ALL # send email alert on all events
module load anaconda/3.0 # load the appropriate module(s) needed by
python hello.py # you program |
A submission script begins with #!/bin/bash, indicating it's a Linux bash script. Comments start with #, while #SBATCH lines specify job scheduling resources for SLURM. Note that #SBATCH directives must be placed at the top of the script, before any other commands. The script requests resources, such as:
#SBATCH -N n or #SBATCH --nodes=n : specifies the number of compute nodes (only 1 in this case)
#SBATCH -t T or #SBATCH --time=T: sets the maximum walltime (hh:mm:ss format)
#SBATCH -J “name" or #SBATCH --job-name="name": assigns a job name
#SBATCH --mail-user=<email_address>: sends email notifications
#SBATCH --mail-type=<type>: sets notification options (BEGIN, END, FAIL, REQUEUE, or ALL)
The script's final section is a standard Linux bash script, outlining job operations. By default, the job starts in the submission folder with the same environment variables as the user. In this example, the script simply runs the python hello.py.
Example 2: job running on multiple nodes
To execute an MPI application across multiple nodes, we need to modify the submission script to request additional resources and specify the MPI execution command:
Code Block |
---|
#!/bin/bash
#MyHelloBatch.slurm
#
#SBATCH -J test # Job name, any string
#SBATCH -o job.%j.out # Name of stdout output file (%j=jobId)
#SBATCH -N 2 # Total number of nodes requested
#SBATCH --ntasks-per-node=16 # set the number of tasks (processes) per node
#SBATCH -t 01:30:00 # Run time (hh:mm:ss) - 1.5 hours
#SBATCH -p highmem # Queue name. Specify gpu for the GPU node.
#SBATCH --mail-user=UNI@cumc.columbia.edu # use only Columbia address
#SBATCH --mail-type=ALL # send email alert on all events
module load openmpi4/4.1.1 # load the appropriate module(s) needed by
mpirun myMPICode # you program |
...
8 20 compute nodes, each with 20-192 core processors and 128 768 GB of memory.
Some 2 nodes have with 192 cores and 1.5 TB of memory.
1 40 GPU node featuring 2 NVIDIA L40s GPU cards
1 GPU node with a Superchip GH200 ARM architecture, 1 GPU, and 570 GB of memory
This guide will help you get started with using SLURM on these clusters.
Introductions
Jobs are executed in batch mode, without user intervention. The typical process involves
Logging into the login node ((link unavailable))
Preparing a job script that specifies the work to be done and required computing resources
Submitting the job to the queue
Optionally logging out while the job runs
Returning to collect the output data once the job is complete
...
Essential SLURM commands
Creating a job submission script
Understanding SLURM partitions
Submitting jobs to the queue
Monitoring job progress
Canceling jobs from the queue
Setting environment variables
Managing job dependencies and job arrays
Commands
The following table summarizes the most commonly used SLURM commands:
...
Command
...
Description
...
sbatch
...
Submits a job script for execution.
...
sinfo
...
Displays the status of SLURM-managed partitions and nodes, with customizable filtering, sorting, and formatting options.
...
squeue
...
Shows the status of jobs, with options for filtering, sorting, and formatting, defaulting to priority order for running and pending jobs.
...
srun
...
Run a parallel job on cluster.
...
scancel
...
Cancels a pending or running job
...
sacct
...
Provides accounting information for active or completed jobs.
...
salloc
...
This command is used to allocate resources and submit an interactive job to Slurm, allowing users to execute tasks in real-time with manual input.
192 cores processors and 768 GB memory.
One NVIDIA Superchip GH200, with 72-core ARM CPU, 1 H100 GPU. Due to tight design, a very high bandwidth between CPU and GPU allows GPU to use of 480 GB CPU memory, apart from GPU’s on-chip 96 GB memory. Although CPU memory is not as fast as GPU’s, a decent 900 GB/s memory bandwidth allows the Large Language Model (LLM) application to access all the 576 GB memory effectively.
The primary network for all the compute nodes and Weka storage is HDR100, the 100 Gbps low-latency Infiniband fabric. Since each node has 192 CPU cores, the need to go across nodes over network is greatly reduced. If a large MPI application still needs to use multiple nodes, the low-latency Infiniband fabric greatly removes network bottlenecks. Each node also has a 25 Gbps IP network for applications that must use IP network. Additionally, a set of login nodes running on Proxmox virtualization provide a pool of virtual login nodes for user access to this and other systems. The login nodes are the primary gateways to the HPC cluster. They are not expected to run heavy duty interactive work, for which users can always start an interactive shell session on a compute node that has more compute resources than login nodes. See below.
Storage for the cluster is provided exclusively by our Weka with over 1 PB of a large bank of all-NVMe very fast drives. WekaFS further boosts performance by distributing the load to 8 servers in parallel. Applications can use GPU-Direct RDMA technology, which bypasses Kernel-based network stack to access data from Weka directly via Mellanox ConnectX-6 NICs. See https://docs.nvidia.com/cuda/gpudirect-rdma/ for more details about this technology.
If you're experiencing issues with the cluster, please reach out to dsbit_help@cumc.columbia.edu for support. To facilitate a quick and precise response, be sure to include the following in your email:
Your Columbia University ID (UNI)
Job ID numbers (if your issue is related to a specific job)
Info |
---|
This HPC cluster exclusively accepts MC credentials for authentication. However, to access the cluster, you also need an active HPC account with C2B2. If you don't have an account, please reach out to dsbit_help@cumc.columbia.edu to request one. |
Getting Access
In order to get access to this HPC cluster, every research group needs to establish a PI Account using an MoU-SLA agreement that can be downloaded DSBIT-MOU-SLA.pdf This document provides further details about modalities, rights & responsibilities, and charges etc.
Logging In
You will need to use SSH in order to access the cluster. Windows users can use PuTTY or Cygwin or MobaXterm. MacOS users can use the built-in Terminal application.
Users log in to the cluster's login node at hpc.c2b2.columbia.edu using MC credentials
$ ssh <UNI>@hpc.c2b2.columbia.edu |
Interactive login to Compute Node
All users will access the HPC resources via a login node. These nodes are meant for basic tasks like editing files or creating new directories, but not for heavy workloads. If you need to perform certain heavy duty tasks in an interactive mode, you must open an interactive shell session on a compute node using SLURM’s srun command, like an example below. See the SLURM User Guide link on the side navigation to learn more about SLURM.
srun --pty -t 1:00:00 /bin/bash |
Interactive login on GPU node
srun -p gpu --gres=gpu:L40S:1 --mem=8G --pty /bin/bash |
Interactive login on GPU node With memory and time limit
srun -n 1 --time=01:00:00 -p gpu --gres=gpu:L40S:1 --mem=10G --pty /bin/bash |