graph TD A["sbatch sbatch_test.sh"] --"1"--> B B["echo 1 job"] A --"2"--> C["echo 2 job"] A --"3"--> D["echo 3 job"]
4 Batch Processing and Submitting Jobs
4.1 Learning Objectives
- Execute a script to run over a list of files on one system
- Batch Process files to run on the HPC cluster
4.2 Using for
loops to cycle through files
A very common pattern is cycling through multiple files in a folder and applying the same script or command to them.
There is a simple method for batch processing a bunch of files: a for
loop.
- 1
-
Start
for
loop and cycle through all.qmd
files - 2
- Start of instructions
- 3
-
Count the words in each
.qmd
file usingwc
- 4
- End of instructions
If we run this in our repository, we get something similar to this:
22 92 634 01_assignment.qmd
462 2417 15396 01_basics.qmd
5 8 49 02_assignment.qmd
303 1577 9509 02_scripting.qmd
303 1667 10233 03_batch.qmd
198 1368 9027 04_containers_workflows.qmd
205 1151 7858 configuring.qmd
53 363 2516 index.qmd
9 25 157 intro.qmd
214 1314 8001 miscellaneous.qmd
The *.qmd
(the wildcard operator, also known as a ) can be used in various ways. For example, if our files are in a folder called raw_data/
, we could specify:
for file in raw_data/*.fq
do
bwa mem
done
4.2.1 For more info on globs
See page 12 in Bite Size Bash.
4.2.2 Using file manifests
One approach that I use a lot is using file manifests to process multiple sets of files. Each line of the file manifest will contain all of the related files I need to process.
For example, if I am aligning paired-end reads, then I can have a tab-separated column for the first strand, and a column for the second strand.
read read2
sample1_1.fq sample1_2.fq
sample2_1.fq sample2_2.fq
The one trick with using file manifests in bash is that we need to change the what’s called the internal field separator (IFS), which specifies how to split up a string with a for
loop. By default, bash uses an IFS of ” ” (a space), which means that the for loop will cycle through words (strings separated by spaces), instead of lines.
We can change this behavior by setting the IFS at the beginning of our script:
for file in (cat manifest.txt)
1IFS=""
do
bwa mem ${FASTA_LOCATION} file > ${OUTPUT_FOLDER}/${file}.bam
samtools sort ${OUTPUT_FOLDER}/${file}.bam |> ${OUTPUT_FOLDER}/${file}.sorted.bam
done
2unset IFS
- 1
-
Change IFS to be
""
(no space), to process a file line by line. - 2
- Reset IFS to original behavior.
4.3 Batching on HPC
Now we can start to do more advanced things on the HPC: use one machine to process each file.
Let’s start out with scripts.
4.3.1 SLURM Scripts
SLURM scripts are a special kind of shell script that contain additional information for the SLURM manager. This includes:
- Number of nodes (machines) to request
- Memory and CPU requirements for each machine
We specify these using a special kind of comment: SLURM directives. Directives begin a line with #SBATCH
:
#SBATCH --nodes=1
In this example, we are specifying the number of nodes.
4.3.2 SLURM Directives
We are able to set some configuration on running our jobs.
#!/bin/bash
1#SBATCH --nodes=1
2#SBATCH --array=1-3
3#SBATCH --mem-per-cpu=1gb
4#SBATCH --time=00:05:00
5./samtools_opt sort SRR1576820_000${SLURM_ARRAY_TASK_ID}.bam -o SRR1576820_000${SLURM_ARRAY_TASK_ID}.sorted.bam
- 1
- request 1 node
- 2
- start an array
- 3
- request 1 gigabyte per cpu
- 4
- ask for 5 minutes on the node
- 5
-
Run
samtools sort
on a bam file, and output it (will do for the whole job array)
4.3.3 Job Arrays
This line:
#SBATCH --array=1-6
Will create a job array. This will create a variable called $SLURM_ARRAY_TASK_ID
that will cycle through the numbers 1-6. Each Task ID corresponds to a different subjob. Let’s try a simpler script to show what’s going on:
#| eval: false
#| filename: sbatch_test.sh
#!/bin/bash
#SBATCH --array=1-3
#SBATCH --nodes=1
echo "${SLURM_ARRAY_TASK_ID} job"
This is a minimal script that will execute 3 subjobs. It will cycle through the job array and print the array number for each job.
#| eval: false
sbatch sbatch_test.sh
On submitting, we will get a message like this (your job number will be different):
Submitted batch job 26328834
And if we look for the output files:
ls -l slurm-26328834*
We will get the following output:
-rw-rw---- 1 tladera2 g_tladera2 8 Jul 15 13:50 slurm-26328834_1.out
-rw-rw---- 1 tladera2 g_tladera2 8 Jul 15 13:50 slurm-26328834_2.out
-rw-rw---- 1 tladera2 g_tladera2 8 Jul 15 13:50 slurm-26328834_3.out
Taking a look at one of these files using cat
:
cat slurm-26328834_3.out
We’ll see this:
3 job
What happened here? sbatch
submitted our job array as 3 different subjobs to 3 different nodes under a single job id. Each node then outputs a file with the subjob id that contains the job number.
4.3.4 Processing files using Job Arrays
So now we know that ${SLURM_ARRAY_TASK_ID}
will let us specify a subjob within our script, how do we use it in our script?
4.3.5 scancel
ing a job array
As we noted, one of the strengths in using a job array to process multiple files is that they are spawed as sub or child jobs of a parent job id.
What if we made a mistake? We can use the scancel
command to cancel the entire set of jobs by giving it our parent job id:
scancel 26328834
This will cancel all sub jobs related to the parent job.