We have made a simple interactive tutorial to demonstrate the different types of array batch jobs available.
To start the tutorial login to the submit node and run cp-demo.sh to copy the sample files as shown below.
$ ssh axon.rc.zi.columbia.edu [aa3301@axon ~]$ cp-demo.sh Copying Slurm Tutorial samples to slurm-tutorial1 in your home directory [aa3301@axon ~]$ cd slurm-tutorial1/ [aa3301@axon slurm-tutorial1]$ ls -l total 10 -rwxr-xr-x 1 aa3301 domain users 824 Sep 12 13:19 array_job.sh -rwxr-xr-x 1 aa3301 domain users 507 Sep 12 13:19 bad_array_job.sh -rw-r--r-- 1 aa3301 domain users 464 Sep 12 13:19 hello-world.sh -rwxr-xr-x 1 aa3301 domain users 76 Sep 12 13:19 jobstep.slurm -rw-r--r-- 1 aa3301 domain users 233 Sep 12 13:19 my-jobste-array.slurm
Let's take a look at some of the job files in the directory starting with array_job.sh.
#!/bin/bash #SBATCH --job-name=array_job_test # Job name #SBATCH --mail-type=FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL) #SBATCH --mail-user=myemail # Where to send mail (e.g. uni123@columbia.edu) #SBATCH --ntasks=1 # Run a single task #SBATCH --mem=1gb # Job Memory #SBATCH --time=00:05:00 # Time limit hrs:min:sec #SBATCH --output=array_%A-%a.log # Standard output and error log #SBATCH --array=1-5 # Array range echo "There are $(env | grep -c SLURM) slurm environmental variables set." env | grep SLURM | sort if [ -z $SLURM_JOB_ID ] then echo "You're not running in a slurm job." exit else FILE=/usr/share/dict/american-english WORD=$(sed -n ${SLURM_JOB_ID}p $FILE) echo echo "$WORD is word number $SLURM_JOB_ID in $FILE" fi
This script shows an example of how SLURM environmental variables change between jobs a single array and how they can be leveraged so that the work done by each job can be different.
Another item of note in this script is that we have made it executable (e.g. chmod +x array_job.sh), so if we run it directly from the shell it will function as a normal shell script, ignoring any slurm directives since they are interpreted as comments.
[aa3301@axon slurm-tutorial1]$ ./array_job.sh There are 0 slurm environmental variables set. You're not running in a slurm job.
In this case the SBATCH variables are ignored, no SLURM environmental variables are detected and we can change the output of the program to behave differently in this environment.
Now we can try running the same script through Slurm and see how it behaves.
[aa3301@axon slurm-tutorial1]$ sbatch array_job.sh Submitted batch job 2759 [aa3301@axon slurm-tutorial1]$ cat array_2759-1.log There are 40 slurm environmental variables set. SLURM_ARRAY_JOB_ID=2759 SLURM_ARRAY_TASK_COUNT=5 SLURM_ARRAY_TASK_ID=1 SLURM_ARRAY_TASK_MAX=5 SLURM_ARRAY_TASK_MIN=1 SLURM_ARRAY_TASK_STEP=1 SLURM_CHECKPOINT_IMAGE_DIR=/var/slurm/checkpoint SLURM_CLUSTER_NAME=axon SLURM_CPUS_ON_NODE=1 SLURMD_NODENAME=ax04 SLURM_GTIDS=0 SLURM_JOB_ACCOUNT=zrc SLURM_JOB_CPUS_PER_NODE=1 SLURM_JOB_GID=413600513 SLURM_JOB_ID=2760 SLURM_JOBID=2760 SLURM_JOB_NAME=array_job_test SLURM_JOB_NODELIST=ax04 SLURM_JOB_NUM_NODES=1 SLURM_JOB_PARTITION=burst SLURM_JOB_QOS=normal SLURM_JOB_UID=413601236 SLURM_JOB_USER=aa3301 SLURM_LOCALID=0 SLURM_MEM_PER_NODE=1024 SLURM_NNODES=1 SLURM_NODE_ALIASES=(null) SLURM_NODEID=0 SLURM_NODELIST=ax04 SLURM_NPROCS=1 SLURM_NTASKS=1 SLURM_PRIO_PROCESS=0 SLURM_PROCID=0 SLURM_SUBMIT_DIR=/axsys/home/aa3301/slurm-tutorial1 SLURM_SUBMIT_HOST=axon.rc.zi.columbia.edu SLURM_TASK_PID=29726 SLURM_TASKS_PER_NODE=1 SLURM_TOPOLOGY_ADDR=ax04 SLURM_TOPOLOGY_ADDR_PATTERN=node SLURM_WORKING_CLUSTER=axon:slurm:6817:8448 Achorn is word number 2760 in /usr/share/dict/words
Running the job from the batch mode expands the SLURM environmental variables which provide a fair deal of information about the running job including the directory which it was started on and other characteristics of the environment. We put in some conditional logic which used the SLURM_JOB_ID variable to read a specific line from a file. Typically it probably makes more sense to read one of the task ids such as SLURM_ARRAY_TASK_ID , but this gives us a little variety.
As opposed to the hello_word script from before we are now modifying the log file names via a SBATCH directive: #SBATCH --output=array_%A-%a.log. Let's compare the output of two of the logs.
[aa3301@axon slurm-tutorial1]$ sdiff array_2759-1.log array_2759-2.log -s SLURM_ARRAY_TASK_ID=1 | SLURM_ARRAY_TASK_ID=2 SLURM_JOB_ID=2760 | SLURM_JOB_ID=2761 SLURM_JOBID=2760 | SLURM_JOBID=2761 SLURM_TASK_PID=29726 | SLURM_TASK_PID=29727 Achorn is word number 2760 in /usr/share/dict/words | Achras is word number 2761 in /usr/share/dict/words
Comparing the output you can see the differences in the SLURM environments between the two jobs. While 325 was the initial job number of the array the actual tasks ran as successive jobs. Again you can see SLURM_ARRAY_TASK_ID is probably the best variable to work with, but feel free to use any that you see fit.
The next example job named bad_array_job.sh is a simple script to show how Slurm interprets program crashes and job failures.
#!/bin/bash #SBATCH --job-name=bad_array_test # Job name #SBATCH --output=bad_array_%A-%a.log # Standard output and error log #SBATCH --array=1-3 # Array range echo "This isn't going to turn out good!" echo SLURM_ARRAY_TASK_ID=$SLURM_ARRAY_TASK_ID let number=$SLURM_ARRAY_TASK_ID-2 let result=1000/$number echo $result
In this job we are subtracting 2 from the SLURM_ARRAY_TASK_ID variable, and then performing a division with that number, so when the variable is equal to 0 we will get an error. We are going to run the job and cat the logs.
[aa3301@axon slurm-tutorial1]$ sbatch bad_array_job.sh Submitted batch job 2764 [aa3301@axon slurm-tutorial1]$ cat bad_array_*.log This isn't going to turn out good! SLURM_ARRAY_TASK_ID=1 -1000 This isn't going to turn out good! SLURM_ARRAY_TASK_ID=2 /var/spool/slurmd/job02766/slurm_script: line 12: let: result=1000/0: division by 0 (error token is "0") This isn't going to turn out good! SLURM_ARRAY_TASK_ID=3 1000
So all 3 tasks ran even though the middle task crashed. Let's take a look at how Slurm thought the job did.
[aa3301@axon slurm-tutorial1]$ sacct -j 2764 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 2764_3 bad_array+ burst zrc 1 COMPLETED 0:0 2764_3.batch batch zrc 1 COMPLETED 0:0 2764_1 bad_array+ burst zrc 1 COMPLETED 0:0 2764_1.batch batch zrc 1 COMPLETED 0:0 2764_2 bad_array+ burst zrc 1 COMPLETED 0:0 2764_2.batch batch zrc 1 COMPLETED 0:0
As you can see in the output above from Slurm's sacct command the job and tasks registered as completed. Two other items to note here, the jobs didn't show they completed in order and the numbering in this report is actually different from the actual job id.
Slurm reports jobs as failing in due to different conditions. In this case the fact that the script in task 2 failed didn't register a failure since slurm was able to complete the command it was trying to run successfully, even though that command actually failed when it was executed.
Conversely if we place some random characters at the bottom of that script so Slurm can't run all the commands in the submission script and we get a different result.
[aa3301@axon slurm-tutorial1]$ echo bad-command >> bad_array_job.sh [aa3301@axon slurm-tutorial1]$ sbatch bad_array_job.sh Submitted batch job 2770 [aa3301@axon slurm-tutorial1]$ cat bad_array_2770-*.log This isn't going to turn out good! SLURM_ARRAY_TASK_ID=1 -1000 /var/spool/slurmd/job02771/slurm_script: line 14: bad-command: command not found This isn't going to turn out good! SLURM_ARRAY_TASK_ID=2 /var/spool/slurmd/job02772/slurm_script: line 12: let: result=1000/0: division by 0 (error token is "0") /var/spool/slurmd/job02772/slurm_script: line 14: bad-command: command not found This isn't going to turn out good! SLURM_ARRAY_TASK_ID=3 1000 /var/spool/slurmd/job02770/slurm_script: line 14: bad-command: command not found [aa3301@axon slurm-tutorial1]$ sacct -j 2770 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 2770_3 bad_array+ burst zrc 1 FAILED 127:0 2770_3.batch batch zrc 1 FAILED 127:0 2770_1 bad_array+ burst zrc 1 FAILED 127:0 2770_1.batch batch zrc 1 FAILED 127:0 2770_2 bad_array+ burst zrc 1 FAILED 127:0 2770_2.batch batch zrc 1 FAILED 127:0
Now the job shows as failed, with an accompanying exit code (127 "command not found"). Still in this case the subsequent jobs/tasks continued executing even though the first failed.