4  Batch Processing and Submitting Jobs

4.1 Learning Objectives

  • Execute a script to run over a list of files on one system
  • Batch Process files to run on the HPC cluster

4.2 Using for loops to cycle through files

A very common pattern is cycling through multiple files in a folder and applying the same script or command to them.

There is a simple method for batch processing a bunch of files: a for loop.

#!/bin/bash
1for file in *.qmd
2do
3  wc $file
4done
1
Start for loop and cycle through all .qmd files
2
Start of instructions
3
Count the words in each .qmd file using wc
4
End of instructions

If we run this in our repository, we get something similar to this:

      22      92     634 01_assignment.qmd
     462    2417   15396 01_basics.qmd
       5       8      49 02_assignment.qmd
     303    1577    9509 02_scripting.qmd
     303    1667   10233 03_batch.qmd
     198    1368    9027 04_containers_workflows.qmd
     205    1151    7858 configuring.qmd
      53     363    2516 index.qmd
       9      25     157 intro.qmd
     214    1314    8001 miscellaneous.qmd

The *.qmd (the wildcard operator, also known as a ) can be used in various ways. For example, if our files are in a folder called raw_data/, we could specify:

for file in raw_data/*.fq
do
  bwa mem 
done
A common pattern: a folder with only one type of file in them

One thing that makes it easier to process a bunch of files is to have the data be in the same folder, with nothing else in them.

I might create a fastq/ file where I store my data, so I can pass the wildcard fastq/* to process the files in that.

#!/bin/bash
module load BWA
module load SAMtools
FASTA_LOCATION=""
OUTPUT_FOLDER="/hpc/temp/my_lab/project_x/bam_files/"
for file in fastq/*
do
  bwa mem ${FASTA_LOCATION} file > ${OUTPUT_FOLDER}/${file}.bam 
  samtools sort ${OUTPUT_FOLDER}/${file}.bam |> ${OUTPUT_FOLDER}/${file}.sorted.bam
done
module purge

4.2.1 For more info on globs

See page 12 in Bite Size Bash.

Selecting files with complicated patterns: Regular Expressions

4.2.2 Using file manifests

One approach that I use a lot is using file manifests to process multiple sets of files. Each line of the file manifest will contain all of the related files I need to process.

For example, if I am aligning paired-end reads, then I can have a tab-separated column for the first strand, and a column for the second strand.

read           read2
sample1_1.fq   sample1_2.fq
sample2_1.fq   sample2_2.fq

The one trick with using file manifests in bash is that we need to change the what’s called the internal field separator (IFS), which specifies how to split up a string with a for loop. By default, bash uses an IFS of ” ” (a space), which means that the for loop will cycle through words (strings separated by spaces), instead of lines.

We can change this behavior by setting the IFS at the beginning of our script:

for file in (cat manifest.txt)
1IFS=""
do
  bwa mem ${FASTA_LOCATION} file > ${OUTPUT_FOLDER}/${file}.bam 
  samtools sort ${OUTPUT_FOLDER}/${file}.bam |> ${OUTPUT_FOLDER}/${file}.sorted.bam
done

2unset IFS
1
Change IFS to be "" (no space), to process a file line by line.
2
Reset IFS to original behavior.

4.3 Batching on HPC

Now we can start to do more advanced things on the HPC: use one machine to process each file.

Let’s start out with scripts.

4.3.1 SLURM Scripts

SLURM scripts are a special kind of shell script that contain additional information for the SLURM manager. This includes:

  1. Number of nodes (machines) to request
  2. Memory and CPU requirements for each machine

We specify these using a special kind of comment: SLURM directives. Directives begin a line with #SBATCH:

#SBATCH --nodes=1 

In this example, we are specifying the number of nodes.

4.3.2 SLURM Directives

We are able to set some configuration on running our jobs.

#!/bin/bash
1#SBATCH --nodes=1
2#SBATCH --array=1-3
3#SBATCH --mem-per-cpu=1gb
4#SBATCH --time=00:05:00
5./samtools_opt sort SRR1576820_000${SLURM_ARRAY_TASK_ID}.bam -o SRR1576820_000${SLURM_ARRAY_TASK_ID}.sorted.bam
1
request 1 node
2
start an array
3
request 1 gigabyte per cpu
4
ask for 5 minutes on the node
5
Run samtools sort on a bam file, and output it (will do for the whole job array)
More about directives

Much more information about the kinds of directives that you can specify in a SLURM script is available here: https://www.osc.edu/supercomputing/batch-processing-at-osc/slurm_directives_summary

The most important directive you should be aware of is how

4.3.3 Job Arrays

This line:

#SBATCH --array=1-6 

Will create a job array. This will create a variable called $SLURM_ARRAY_TASK_ID that will cycle through the numbers 1-6. Each Task ID corresponds to a different subjob. Let’s try a simpler script to show what’s going on:

#| eval: false
#| filename: sbatch_test.sh
#!/bin/bash
#SBATCH --array=1-3
#SBATCH --nodes=1
echo "${SLURM_ARRAY_TASK_ID} job"

This is a minimal script that will execute 3 subjobs. It will cycle through the job array and print the array number for each job.

#| eval: false
sbatch sbatch_test.sh

On submitting, we will get a message like this (your job number will be different):

Submitted batch job 26328834

And if we look for the output files:

ls -l slurm-26328834*

We will get the following output:

-rw-rw---- 1 tladera2 g_tladera2 8 Jul 15 13:50 slurm-26328834_1.out
-rw-rw---- 1 tladera2 g_tladera2 8 Jul 15 13:50 slurm-26328834_2.out
-rw-rw---- 1 tladera2 g_tladera2 8 Jul 15 13:50 slurm-26328834_3.out

Taking a look at one of these files using cat:

cat slurm-26328834_3.out

We’ll see this:

3 job

graph TD
  A["sbatch sbatch_test.sh"] --"1"--> B
  B["echo 1 job"]
  A --"2"--> C["echo 2 job"]
  A --"3"--> D["echo 3 job"]

What happened here? sbatch submitted our job array as 3 different subjobs to 3 different nodes under a single job id. Each node then outputs a file with the subjob id that contains the job number.

4.3.4 Processing files using Job Arrays

So now we know that ${SLURM_ARRAY_TASK_ID} will let us specify a subjob within our script, how do we use it in our script?

4.3.5 scanceling a job array

As we noted, one of the strengths in using a job array to process multiple files is that they are spawed as sub or child jobs of a parent job id.

What if we made a mistake? We can use the scancel command to cancel the entire set of jobs by giving it our parent job id:

scancel 26328834

This will cancel all sub jobs related to the parent job.