Containers
We support the use of Singularity/Apptainer on Axon, but not Docker (due to security concerns.) Luckily it's possible to import Docker containers into Singularity/Apptainer.
We have some pre-built containers on Axon currently
$ ls /share/singularity/ │ DeepMimic.def DeepMimic-GPU.def DeepMimic-GPU.sif DeepMimic.sif mmaction2.def mmaction2.sif mmaction.def mmaction.sif tsn_latest.sif
If you need a different container, your best bet is to build it on a machine that you have root or Admin rights on, and then upload it to Axon. If you can't get that to work you can reach out to us to build the container for you.
There are online repositories with pre-built containers as well https://cloud.sylabs.io/library
Interactive Session Usage
You can experiment with a container by starting on in an interactive session before coding a job. This example uses the mmaction2 container that is available on Axon, and loads CUDA and CUDNN (the versions are just examples.)
# On axon.rc, open an interactive session with 1 GPU. srun --pty --gres=gpu:1 bash -i # Load CUDA and CUDNN ml load cuda/10.1.168 ml load cudnn/7.3.0 singularity run --nv /share/singularity/mmaction2.sif # Note that if you require access to data available on Axon, you can bind an Axon path to be available in the container, don't use "[" and "]", just the path: singularity run --nv --bind [Path to data on Axon]:[Path where you would like data to be at within the container] /share/singularity/mmaction2.sif
Batch Script Usage
You can use SLURM to submit batch jobs that run in containers as well, here's an example making a script out of the interactive session above. Paths to the data are examples only, be sure to use your actual paths. In this example, the CTN projects directory located at /share/ctn/projects is mounted as /projects within the container, and the config file lives within /share/ctn/projects during a normal Axon session but within /projects once the container is active. Note the use of singularity exec as opposed to singularity run.
#SBATCH --job-name=mmaction # The job name. #SBATCH -c 16 # The number of cores. #SBATCH --mem-per-cpu=1gb # The memory the job will use per cpu core. #SBATCH --gres=gpu:1 # The number of GPUs ml load cuda/10.1.168 ml load cudnn/7.3.0 singularity exec --nv --bind /share/ctn/projects:/projects /share/singularity/mmaction2.sif bash -c "python /mmaction2/tools/train.py /projects/config.py [optional arguments]"
Containers as a module: Singularity Registry HPC (shpc)
Singularity Registry HPC (shpc) allows us to install containers as modules. It is available in the Miniforge-24.7.1-2 module. What follows is a tutorial on how to use PyTorch 22.06 inside a container with Python 3.8.13. We'll use an interactive session with srun
. When we upgrade Slurm salloc
will be the preferred interactive method.
srun --pty -t 06:00:00 --gres=gpu:1 /bin/bash # Next load the Miniforge module that has shpc: ml Miniforge-24.7.1-2 # Let's load the the shpc modules: ml use module use /share/apps/Miniforge/lib/python3.12/site-packages/modules # We'll see some new modules become available to load: ml av --------- /share/apps/Miniforge/lib/python3.12/site-packages/modules --------- bids/freesurfer/V30-a43f1f/module.tcl nvcr.io/nvidia/pytorch/22.06-py3/module.tcl rocker/tidyverse/4.4.2/module.tcl tensorflow/tensorflow/2.7.1-gpu/module.tcl tensorflow/tensorflow/latest-gpu/module.tcl # Let's load the PyTorch module: ml nvcr.io/nvidia/pytorch/22.06-py3/module.tcl # Next load Python from within the container: pytorch-shell INFO: Environment variable SINGULARITY_SHELL is set, but APPTAINER_SHELL is preferred # Then start Python and start using PyTorch: python Python 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10) [GCC 10.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import torch >>> torch.cuda.device_count() 1 >>> print(torch.__version__) 1.13.0a0+340c412 # Now we can check that PyTorch is using a GPU: >>> import torch >>> >>> # Step 1: Check if CUDA is available >>> device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') >>> print(f'Using device: {device}') Using device: cuda >>> >>> # Step 2: Create a sample tensor and move it to the GPU >>> tensor_size = (10000, 10000) # Size of the tensor >>> a = torch.randn(tensor_size, device=device) # Random tensor on GPU >>> b = torch.randn(tensor_size, device=device) # Another random tensor on GPU >>> >>> # Step 3: Perform operations on GPU >>> c = a + b # Element-wise addition >>> >>> # Print the result (moving back to CPU for printing) >>> print("Result shape (moved to CPU for printing):", c.cpu().shape) Result shape (moved to CPU for printing): torch.Size([10000, 10000]) >>> >>> # Optional: Check if GPU memory is being utilized >>> print("Current GPU memory usage:") Current GPU memory usage: >>> print(f"Allocated: {torch.cuda.memory_allocated(device) / (1024 ** 2):.2f} MB") Allocated: 1146.00 MB >>> print(f"Cached: {torch.cuda.memory_reserved(device) / (1024 ** 2):.2f} MB") Cached: 1146.00 MB