Research Computing at C2B2 Home
About us:
DSBIT (https://systemsbiology.columbia.edu/dsbit) is one of the Certified Information Technology Groups (CITG) as per CUIMC-CITG charter: https://www.it.cuimc.columbia.edu/about-us/certified-it-groups-citgs
Our Mission:
Provide our Research community with a powerful & secure infrastructure for High Performance Computing (HPC), Big Data Storage, Application Servers, Data Center co-location/hosting services, and Desktop support services, in compliance with CITG charter.
High Performance Computing (HPC) Cluster in C2B2
The Center for Computational Biology and Bioinformatics (C2B2) is located at the heart of Columbia University Irving Medical Center (CUIMC) campus. C2B2 has been managing a large Research Computing and Data Storage (RCD) computing infrastructure since 2007. This infrastructure includes a modern Data Center, housing about one hundred 19” wide 42U high server racks, powered by a 1 MW battery-backed Uninterruptable Power Supply (UPS) system, Facilities-managed air-conditioning system, and Public Safety manages secure access system. The original construction of the facility was funded by the State of New York. The server infrastructure was supported by National Institutes of Health (NIH) grants in 2012, 2016 and 2024. The facility has a High-Performance Computing (HPC) cluster, managed by systems engineers from DSBIT under the charter of Certified IT Groups (CITG).
The HPC cluster is periodically upgraded to stay current with modern RCD standards. The most recent upgrade was done in during July-August 2024 under NIH award 1S10OD032433-01A1. Our funding proposal for the acquisition of a state-of-the-art high performance computing cluster received an excellent review and was funded with a maximum grant allowed under the S10 program.
The HPC cluster was designed in close collaboration with the engineering teams at Dell and NVIDIA, and with the input of members of the research community of CUIMC. This allowed us to maximize the return on investment of the grant funds, by tuning the configuration with the latest components and sub-systems that can reliably support the computational needs of our researchers for the next five years.
The main highlights of this HPC cluster are as follows:
GPUs: 80 NVIDIA L40S GPUs with 48GB memory on GPU chips. These GPUs have 76 billion transistors on the chip, only slightly below the top-of-the-line H100 GPUs that have 82 billion. Performance wise L40S are placed in the middle of a highly successful A100 and the latest H100 GPUs.
As detailed in the following table, the L40S outperform A100 GPUs on most of the metrics, particularly the FP32 math (single-precision floating-point) which is most important for our workflows. Our research community feels that the L40S GPUs occupy the sweet spot of price vs performance for our workloads and these GPUs are readily available in the market.
CPUs: 128 AMD EPYC 9654 processors that pack 96-cores on each chip. Given their superior pricing and core density profile, increasing the CPU core count to 12,000 makes good sense and will serve well our general-purpose users whose workloads utilize primarily CPUs.
Memory: 49 TBs spread across 64 compute nodes. Each compute node carries at least 768 GB memory, while a few nodes have twice as much. The compute jobs that need very large memory can use node reservation through the SLURM scheduler allocating up to the whole of 1.5 TB available for a single job.
Networking: All nodes in the HPC cluster are interconnected via a private very fast, low latency, HDR200/100 InfiniBand (IB) fabric. The cluster also has 100 Gb/s Ethernet for those application that are unable to utilize IB technology. The cluster has multiple 10 Gb/s links to the University core that connects most of the departments & labs for easy data transfers between this HPC cluster and the instruments. We use industry-standard GLOBUS to move & share data securely within C2B2, across campus, external collaborators and the Cloud.
Storage: To match the data i/o requirements of power compute nodes, we have purchased the WEKA storage appliance based on all-NVMe 1 PB usable storage. Currently, WEKA is among the highest performance storage system available in the market. This storage system also utilizes Remote Direct Memory Access (RDMA) technology, allowing the compute nodes to access data in the storage tier without suffering the traditional network and storage protocol overheads. Further, WEKA is a Software-defined-Storage system, which makes it hardware agnostic, flexible, and easily extensible. This new flash drive not only upgrades storage capacity and I/O performance but also improves reliability and enables quicker recovery from drive failures, due to greatly reduced drive rebuild times.
Our existing Isilon system having 3 PB storage capacity on spinning drives, will continue serving our mass storage needs at lower costs.
OpenHPC: The OpenHPC software stack has rapidly matured in the last few years. The open platform is not only the most popular in universities but, unlike commercial offerings, which are getting increasingly expensive, OpenHPC platform avoids any vendor-locking in the long run. Our HPC cluster follows all components from open source, including Rocky 9.x Linux, Warewulf 4.x, Apptainer, SLURM, Prometheus, Grafana, XDMod, Open OnDemand etc.
Pricing: C2B2 is one of the COREs that operate on CUIMC campus. All COREs are monitored and regulated by the Office of Research which mandates us to operate on a chargeback model to recover the cost. The COREs are required to maintain balanced budgets and the pricing of our services is reviewed annually for each Fiscal Year. The following table lists all the services offered by C2B2 and the current prices: