Job Example

Juypter

This setup allows a user to run Jupyter Lab on a SLURM-managed compute node, facilitating remote access via SSH tunneling. The SLURM script handles resource allocation, while the Jupyter script sets up the environment and launches the server, providing clear instructions for accessing the notebook from a local machine.

SLURM Job Script jupyter.slurm

#!/bin/bash -l #SBATCH --job-name=jupyter # create a short name for your job #SBATCH --nodes=1 # node count #SBATCH --ntasks=1 # total number of tasks across all nodes #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks) #SBATCH --mem=10G # total memory (RAM) per node #SBATCH --time=1:00:00 # total run time limit (HH:MM:SS) module load conda/3 ./jupyter.sh

The above executes another script jupyter.sh, which contains the commands to set up the Jupyter environment and launch Jupyter Lab

export PORT=8888 echo "set up SSH port forwarding between the compute resource and your local computer (laptop/desktop)" echo "ssh -N -L $PORT:[HOST]:$PORT [userID]@hpc.c2b2.columbia.edu" echo "Access the Jupyter Notebook via your web browser on your local computer." echo "http://127.0.0.1:$PORT/" module load conda/3 echo " " echo "=============================================================================" echo "=== jupyterlab.sh Install and run JupyterLab locally" echo "+++ installing jupyter" python -m venv .jup . .jup/bin/activate python -m pip install pip --upgrade python -m pip install --upgrade jupyterlab python -m pip install --upgrade bash_kernel python -m bash_kernel.install python -m pip install --upgrade jupyterlab-spellchecker echo "+++ run jupyter" jupyter-lab --ip=0.0.0.0 --port=$PORT --no-browser
jupyter-lab --ip=0.0.0.0 --port=$PORT --no-browser

This command starts the Jupyter Lab server, making it accessible from any IP address (0.0.0.0) on the specified port. The --no-browser option prevents Jupyter from trying to open a browser on the compute node, which is typically not possible in HPC environments.

Access the Notebook

After submitting the job, you'll need to set up port forwarding to access the Jupyter Notebook server.

  1. Find the allocated node: Check the output of the job submission to see on which node the Jupyter Notebook is running.

  2. SSH into the node: Use the following command to create an SSH tunnel:

  1. Open your browser: In your web browser, go to http://127.0.0.1:$PORT. You should be able to access your Jupyter Notebook server.

GPU JOBS

To add a GPU to your Slurm allocation:

CuPy

CuPy is a library that provides an interface similar to NumPy but is designed to leverage NVIDIA GPUs for accelerated computing. It allows users to perform operations on large arrays and matrices efficiently by utilizing the parallel processing power of GPU. You can roughly think of CuPy as NumPy for GPUs

To install CuPy

In this example we will perform a Singular Value Decomposition (SVD) on a randomly generated matrix using CuPy, which leverages GPU acceleration.

Here is the breakdown of what each part does:

Imports:

perf_counter from time for high-resolution timing.

cupy for GPU array operations.

Matrix Generation:

N = 2000 sets the size of the matrix.

X = cp.random.randn(N, N, dtype=cp.float64) creates a 2000x2000 matrix with normally distributed random numbers.

Timing Execution:

The code runs the SVD operation 5 times to measure performance accurately.

cp.linalg.svd(X) computes the SVD of matrix X.

cp.cuda.Device(0).synchronize() ensures that all GPU operations are complete before timing stops.

Results:

The minimum execution time from the trials is printed.

The sum of the singular values (s) is calculated and displayed.

The CuPy version used is printed.

Below is a sample Slurm script:

Submit the job:

You can track the job's progress using squeue -u $USER. After the job finishes, check the output using cat slurm-*.out. What happen if we will double th value of N ? Will the execution time double? There's also a CPU version of the code; let's give that a try.

PyTorch

PyTorch is an open-source machine learning library widely used for deep learning applications. It provides a flexible and dynamic framework for building neural networks, enabling efficient computation on both CPUs and GPUs.

To install PyTorch

In this example we will utilizes PyTorch to perform Singular Value Decomposition (SVD) on a randomly generated matrix, leveraging GPU acceleration.

Breakdown of the Code:

Imports:

perf_counter from the time module for precise timing.

torch for tensor operations and GPU support.

Matrix Setup:

N = 2000 defines the dimensions of the matrix.

cuda0 = torch.device('cuda:0') specifies that computations should occur on the first GPU.

x = torch.randn(N, N, dtype=torch.float64, device=cuda0) generates a 2000x2000 matrix filled with random numbers, stored on the GPU.

SVD Computation:

t0 = perf_counter() starts the timer. u, s, v = torch.svd(x) computes the SVD of the matrix. elapsed_time = perf_counter() - t0 calculates the total time taken for the operation.

Results:

The execution time is printed.

The sum of the singular values (s) is computed and transferred back to the CPU for display using .cpu().numpy().

The PyTorch version is printed.

Here is a sample Slurm script:

Submit the job:

You can track the job's progress using squeue -u $USER. After the job finishes, check the output using cat slurm-*.out

MATLAB

MATLAB is available on the cluster.

In this example we will use MATLAB to perform Singular Value Decomposition (SVD) on a randomly generated matrix, leveraging GPU acceleration.

Breakdown of the Code:

Initialize GPU Device:

gpu = gpuDevice(); retrieves the current GPU device. fprintf and disp display the name and details of the GPU being used.

Create a GPU Array:

X is created as a GPU array containing specified values. whos X; displays information about the variable X.

Perform SVD:

Computes the SVD of the matrix X, returning matrices U, S, and V.

Calculate and Display the Trace:

Calculates the trace of matrix S (the sum of its diagonal elements) and prints it.

Here is Slurm Script

Submit the job:

You can track the job's progress using squeue -u $USER. After the job finishes, check the output using cat slurm-*.out

C++

In this example C++ CUDA code defines a simple kernel that adds the elements of two arrays and demonstrates the use of Unified Memory.

Here's a breakdown of the code:

Includes:

Includes the necessary headers for input/output and mathematical functions.

Kernel Function:

This function runs on the GPU and performs element-wise addition of arrays x and y.

Note that this implementation is not efficient for parallel execution since it runs a loop on a single thread.

Main Function:

N is set to 220 (1 million elements).

Unified Memory is allocated for the arrays x and y, making them accessible from both the CPU and GPU.

Initialize Arrays:

The host initializes the arrays x and y with values 1.0 and 2.0, respectively.

Kernel Launch:

The kernel is launched with one block and one thread. The device is synchronized to ensure the GPU finishes before accessing the results.

Error Checking:

The maximum error is computed to verify that all elements of y have been correctly updated to 3.0.

Memory Cleanup:

Frees the allocated Unified Memory for x and y.

Output

The program will output the maximum error, which should be very close to 0.0 if the addition was successful.

Here is te Slurm Script

This slurm script Compiles the CUDA file named matrix_add.cu using the NVIDIA CUDA Compiler (nvcc), outputting an executable named matrix_add and executes the compiled CUDA application.

Submit the job:

R

R is already installed in CLUSTER. To use R load the module

In this example we will utilizes GPU to compute the means of a list of random vectors. To install

Here is the R code

Breakdown of the code:

Load the Necessary Library:

library(gpuR) loads the gpuR package, which provides tools for performing GPU-accelerated computations in R. It allows users to utilize the processing power of GPUs for tasks like matrix operations, making computations faster compared to running on a CPU.

Create a List of Random Vectors:

This generates a list of 500 random vectors, each containing 500 normally distributed numbers.

rnorm(500) creates a vector of 500 random numbers from a standard normal distribution (mean = 0, standard deviation = 1).

replicate(500, ...) runs the rnorm(500) function 500 times, creating a list of 500 vectors. The simplify = FALSE argument ensures that the output is a list rather than a matrix.

Define the Mean Calculation Function Using GPU:

This function takes a vector vec as input and computes its mean using the GPU.

gpuVector(vec, type = "float"): Converts the input vector into a gpuVector, which is a data structure compatible with GPU operations. Specifying type = "float" indicates that the data type of the elements is floating-point.

mean(gpu_vec): Computes the mean of the GPU vector. This operation takes advantage of the GPU's parallel processing capabilities, which can be much faster than calculating the mean on a CPU for large datasets.

as.numeric(mean_value): Converts the result back to a numeric value for easier handling in R.

Compute the Means Using GPU:

start_time <- Sys.time(): Records the current time before starting the computation. This is used to measure how long the computation takes.

means_gpu <- sapply(vectors, mean_vector_gpu): Applies the mean_vector_gpu function to each vector in the vectors list. The sapply function simplifies the output to a numeric vector containing the means.

end_time <- Sys.time(): Records the time after the computation has finished.

Print the Execution Time:

This calculates the total execution time by subtracting start_time from end_time and prints the result to the console. The output shows how long the GPU computation took.

Here is the Slurm script:

CPU JOBS

R

R is already installed in CLUSTER. To use R load the module

In this example we will utilizes parallel processing to compute the means of a list of random vectors. To install

Here is the R code

Code Breakdown:

Load Libraries:

Loads the necessary libraries for parallel processing.

Generate Random Vectors:

Creates a list of 1000 random vectors, each containing 1000 normally distributed numbers.

Define the Mean Function:

Defines a function to compute the mean of a vector.

Parallel Mean Calculation:

Creates a cluster with 4 cores.

Registers the cluster for parallel processing.

Measures the execution time for computing the means in parallel using foreach.

Stops the cluster after the computation.

Serial Mean Calculation:

Measures the execution time for computing the means serially using a for loop.

Print Execution Times:

Outputs the execution times for both parallel and serial computations.

Here is the slurm script

This slurm script allocates 4 CPU cores for this task, which matches the parallel processing in R script

Numpy

In this example we will perform a Singular Value Decomposition (SVD) on a randomly generated matrix using Numpy.

Here’s a breakdown of of code:

Imports and Setup:

Import perf_counter for timing, and set up a matrix of size 2000x2000 filled with random values from a normal distribution.

SVD Computation:

Run the SVD computation five times, timing each run and storing the times in a list.

Results:

Finally, print the minimum execution time, the sum of the singular values, and the version of NumPy being used.

Here is slurm script