graph TD A["sbatch sbatch_test.sh"] --"1"--> B B["echo 1 job"] A --"2"--> C["echo 2 job"] A --"3"--> D["echo 3 job"]
6 Batch Processing and Submitting Jobs
6.1 Exercises
6.2 Learning Objectives
- Execute a script to run over a list of files on one system
- Utilize globs to specify multiple files in your script
- Batch Process files to run locally and on the HPC cluster
6.3 Using for loops to cycle through files
A very common pattern is cycling through multiple files in a folder and applying the same script or command to them.
There is a simple method for batch processing a bunch of files: a for loop. In our case, a for loop takes a list of file paths (such as a list of FASTA files we want to process), and performs the same task for each element of the list.
- 1
-
Cycle through the list of
01_assignment.qmdand02_scripting.qmd. - 2
- Start of instructions
- 3
-
Count the words in each
.qmdfile usingwc - 4
- End of instructions
The do and done sandwich the instructions we want to do to each file in our list. wc will produce a word count of these two files.
If we run this in the bash_for_bio/ folder, we’ll get the following:
26 96 656 01_assignment.qmd
493 2609 16206 02_scripting.qmd
6.4 globs: selecting multiple files
However, typing each element of our list is going to be difficult - can we select files in a different way?
We can use globs, or wildcard characters to select multiple files with a criteria.
For example, *.qmd will list all of the .qmd files in the bash_for_bio directory.
- 1
-
Start
forloop and cycle through all.qmdfiles - 2
- Start of instructions
- 3
-
Count the words in each
.qmdfile usingwc - 4
- End of instructions
If we run this in our repository, we get something similar to this:
3220 3220 142485 ./data/CALU1_combined_final.fastq
2484 2484 109917 ./data/HCC4006_final.fastq
1836 1836 81243 ./data/MOLM13_combined_final.fastq
The *.fasta (the wildcard operator, also known as a ) can be used in various ways. For example, if our files are in a folder called data/, we could specify:
for file in ./data/*.fastq
do
wc $file
done
6.4.1 Try it out
./scripts/week3/batch_on_rhino.sh
6.4.2 For more info on globs
See page 12 in Bite Size Bash.
6.5 Batching on HPC
Now we can start to do more advanced things on the HPC cluster: use one machine to process each file. We will use a slightly different mechanism to cycle through files: the .
Let’s start out with scripts.
6.5.1 SLURM Scripts
SLURM scripts are a special kind of shell script that contain additional information for the SLURM manager. This includes:
- Number of nodes (machines) to request
- Memory and CPU requirements for each machine
We specify these using a special kind of comment: SLURM directives. Directives begin a line with #SBATCH:
#SBATCH --nodes=1
In this example, we are specifying the number of nodes.
Note that because directives begin with a # - they are treated like comments by bash, but are usable by SLURM.n
6.5.2 SLURM Directives
We are able to set some configuration on running our jobs.
- 1
- request 1 node
- 2
- start an array
There is a ton of information that I’m not including about the gizmo cluster. For much more info on submitting jobs, please refer to https://sciwiki.fredhutch.org/scicomputing/compute_jobs/.
6.5.3 Job Arrays
This line:
#SBATCH --array=1-3 Will create a job array. This will create a variable called $SLURM_ARRAY_TASK_ID that will cycle through the numbers 1-3. Each Task ID corresponds to a different subjob. Let’s try a simpler script to show what’s going on:
#| eval: false
#| filename: sbatch_test.sh
#!/bin/bash
#SBATCH --array=1-3
#SBATCH --nodes=1
echo "${SLURM_ARRAY_TASK_ID} job"This is a minimal script that will execute 3 subjobs. It will cycle through the job array and print the array number for each job.
#| eval: false
sbatch sbatch_test.shOn submitting, we will get a message like this (your job number will be different):
Submitted batch job 26328834
This will run very quickly on the three nodes. And if we look for the output files:
ls -l slurm-26328834*We will get the following output:
-rw-rw---- 1 tladera2 g_tladera2 8 Jul 15 13:50 slurm-26328834_1.out
-rw-rw---- 1 tladera2 g_tladera2 8 Jul 15 13:50 slurm-26328834_2.out
-rw-rw---- 1 tladera2 g_tladera2 8 Jul 15 13:50 slurm-26328834_3.out
Taking a look at one of these files using cat:
cat slurm-26328834_3.outWe’ll see this:
3 job
What happened here? sbatch submitted our job array as 3 different subjobs to 3 different nodes under a single job id. Each node then outputs a file with the subjob id that contains the job number.
6.5.4 Try it out
Try running
sbatch ./scripts/week3/sbatch_test.shAnd look at the resulting .out files.
6.5.5 Processing lists of files using Job Arrays
So now we know that ${SLURM_ARRAY_TASK_ID} will let us specify a subjob within our script, how do we use it in our script?
Say, we have a list of 3 files in our directory data/, and we can list them by using ../../data/*.fastq. We can use the ${SLURM_ARRAY_TASK_ID} as an index to specify a different file.
The one caveat is that we need to know the number of files beforehand.
This script will run our run_bwa.sh on 3 separate files on 3 separate nodes:
- 1
- Initialize job array to range from 1 to 10
- 2
-
list all files in
../..data/with extension.fastq. Assign it to the array$file_array. - 3
-
For current task id, calculate the appropriate index (we have to subtract 1 because bash arrays begin at
0) - 4
-
Pull the current file path given the task id, assign it to
$current_file. - 5
-
Run
run_bwa.shon$current_file.
We run this on rhino using this command (we are in ~/bash_for_bio/scripts/week3/)
sbatch sbatch_bwa.shWe’ll get the response:
Submitted batch job 35300989
If we take a look at the job queue using squeue, we’ll get something like this:
squeue -u tladera2 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
35300989_1 campus-ne sbatch_b tladera2 R 0:10 1 gizmok34
35300989_2 campus-ne sbatch_b tladera2 R 0:10 1 gizmok40
35300989_3 campus-ne sbatch_b tladera2 R 0:10 1 gizmok79
We can see our three subjobs, which are indicated by the Job IDs 35300989_1, 35300989_2, and 35300989_3.
ladera2@rhino02:~/bash_for_bio/scripts/week3$ ls -ltotal 312
-rw-rw---- 1 tladera2 g_tladera2 211895 Sep 26 10:29 CALU1_combined_final.sam
-rw-rw---- 1 tladera2 g_tladera2 160489 Sep 26 10:29 HCC4006_final.sam
-rw-rw---- 1 tladera2 g_tladera2 121509 Sep 26 10:29 MOLM13_combined_final.sam
-rwxrwxrwx 1 tladera2 g_tladera2 590 Sep 26 09:45 run_bwa.sh
-rwxrwxrwx 1 tladera2 g_tladera2 256 Sep 26 10:28 run_sbatch.sh
-rw-rw---- 1 tladera2 g_tladera2 614 Sep 26 10:29 slurm-35300992_1.out
-rw-rw---- 1 tladera2 g_tladera2 579 Sep 26 10:29 slurm-35300992_2.out
-rw-rw---- 1 tladera2 g_tladera2 619 Sep 26 10:29 slurm-35300992_3.out
And you’ll see we generated our SAM files for each sample! Neat. There are also the .out files from each subjob as well, which will contain the
You can see that we outputted our files to the scripts/week3/ directory. It’s part of your job in the exercises to adapt run_sbatch.sh and run_bwa.sh to output to a directory of your choosing.
The gizmo cluster is used by a lot of people at the Hutch, so it gets busy from time to time.
Don’t worry, your jobs will be eventually processed if your requests are reasonable.
f you want to look at the jobs that a particular user is running, you can use the -u flag. Try squeue -u on one of the users in the queue, for example tladera2:
squeue -u tladera2You will usually use squeue -u on your own username, so you can see the status of your jobs. This is an example from my previous career at OHSU (I didn’t have time to generate one on gizmo)
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
3970834 very_long KIRC.bla wooma R 11-04:21:09 1 exanode-7-15
3970835 very_long OV.blast wooma R 11-04:20:32 1 exanode-7-5
Let’s take a look at the output. You can see some pretty useful info: the JOBID, what PARTITION the job is running under, and the STatus.
STatus is really important, because it can tell you whether your job is:
R (Running), PD (Pending), ST (Stopped), S (Suspended), CD (Completed).
R or PD status is what you want to see, because that means it’s in the queue and will be executed.
For much more info, please check out https://sciwiki.fredhutch.org/scicomputing/compute_jobs/#job-priority and https://sciwiki.fredhutch.org/scicomputing/compute_jobs/#why-isnt-my-job-running.
6.5.6 scanceling a jobarray
As we noted, one of the strengths in using a job array to process multiple files is that they are spawned as sub or child jobs of a parent job id.
What if we made a mistake? We can use the scancel command to cancel the entire set of jobs by giving it our parent job id:
scancel 26328834This will cancel all sub jobs related to the parent job. No fuss, no muss.
6.6 What’s Next?
Next week we will discuss using MiniWDL (a workflow manager) to process files through multi-step workflows in .