Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • 20 compute nodes, each with 192 core processors and 768 GB of memory.

  • 2 nodes with 192 cores and 1.5 TB of memory.

  • 40 GPU node featuring 2 NVIDIA L40s GPU cards 192 cores processors and 768 GB memory.

  • One NVIDIA Superchip GH200, with 72-core ARM CPU, 1 H100 GPU. Due to tight design, a very high bandwidth between CPU and GPU allows GPU to use of 480 GB CPU memory, apart from GPU’s on-chip 96 GB memory. Although CPU memory is not as fast as GPU’s, a decent 900 GB/s memory bandwidth allows the Large Language Model (LLM) application to access all the 576 GB memory pretty welleffectively.

The primary network for all the compute nodes and Weka storage is HDR100, the 100 Gbps low-latency Infiniband fabric. Since each node has 192 CPU cores, the need to go across nodes over network is greatly reduced. If a large MPI application still needs to use multiple nodes, the low-latency Infiniband fabric greatly removes network bottlenecks. Each node also has a 25 Gbps IP network for applications that must use IP network. Additionally, a set of login nodes running on Proxmox virtualization provide a pool of virtual login nodes for user access to this and other systems. The login nodes are the primary gateways to the HPC cluster. They are not expected to run heavy duty interactive work, for which users can always start an interactive shell session on a compute node that has lot more compute resources than login nodes. See below.

Storage for the cluster is provided exclusively by our Weka with over 1 PB of a large bank of all-NVMe very fast drives. WekaFS further boosts performance by distributing the load to 8 servers in parallel. Applications can use GPU-Direct RDMA technology, which bypasses Kernel-based network stack to access data from Weka directly via Mellanox ConnectX-6 NICs. See https://docs.nvidia.com/cuda/gpudirect-rdma/ for more details about this technology.

...

Interactive login to Compute Node

All user users will access the HPC resources via a login node. These nodes are meant for basic tasks like editing files or creating new directories, but not for heavy workloads. If you need to perform certain heavy duty tasks in an interactive mode, you must open an interactive shell session on a compute node using SLURM’s srun command, like an example below. See the SLURM User Guide link on the side navigation to learn more about SLURM.

...