6 Batch Processing and Submitting Jobs

6.1 Exercises

6.2 Learning Objectives

Execute a script to run over a list of files on one system
Utilize globs to specify multiple files in your script
Batch Process files to run locally and on the HPC cluster

6.3 Using `for` loops to cycle through files

A very common pattern is cycling through multiple files in a folder and applying the same script or command to them.

There is a simple method for batch processing a bunch of files: a for loop. In our case, a for loop takes a list of file paths (such as a list of FASTA files we want to process), and performs the same task for each element of the list.

1for file in 01_assignment.qmd 02_scripting.qmd
2do
3  wc $file
4done

1: Cycle through the list of 01_assignment.qmd and 02_scripting.qmd.
2: Start of instructions
3: Count the words in each .qmd file using wc
4: End of instructions

The do and done sandwich the instructions we want to do to each file in our list. wc will produce a word count of these two files.

If we run this in the bash_for_bio/ folder, we’ll get the following:

      26      96     656 01_assignment.qmd
     493    2609   16206 02_scripting.qmd

6.4 globs: selecting multiple files

However, typing each element of our list is going to be difficult - can we select files in a different way?

We can use globs, or wildcard characters to select multiple files with a criteria.

For example, *.qmd will list all of the .qmd files in the bash_for_bio directory.

#| filename: ./scripts/week3/batch_on_rhino.sh
#!/bin/bash
1for file in ./data/*.fastq
2do
3  wc $file
4done

1: Start for loop and cycle through all .qmd files
2: Start of instructions
3: Count the words in each .qmd file using wc
4: End of instructions

If we run this in our repository, we get something similar to this:

3220   3220 142485 ./data/CALU1_combined_final.fastq
2484   2484 109917 ./data/HCC4006_final.fastq
1836   1836 81243 ./data/MOLM13_combined_final.fastq

The *.fasta (the wildcard operator, also known as a ) can be used in various ways. For example, if our files are in a folder called data/, we could specify:

for file in ./data/*.fastq
do
  wc $file 
done

6.4.1 Try it out

./scripts/week3/batch_on_rhino.sh

A common pattern: a folder with only one type of file in them

One thing that makes it easier to process a bunch of files is to have the data be in the same folder, with nothing else in them.

For example, I might create a fastq/ folder where I store my data, so I can pass the glob fastq/* to process the files in that.

#!/bin/bash
module load BWA
module load SAMtools
FASTA_LOCATION=""
OUTPUT_FOLDER="/home/tladera2/project_x/bam_files/"
for file in fastq/*
do
  bwa mem ${FASTA_LOCATION} file > ${OUTPUT_FOLDER}${file}.bam 
done
module purge

6.4.2 For more info on globs

See page 12 in Bite Size Bash.

Selecting files with complicated patterns: Regular Expressions

At some point, globs are going to be inadequate depending on how you store files. At that point, you will probably need to learn regular expressions, which is a much more powerful way of describing search patterns.

I will be honest and say this is one thing LLMs are very good at. But you should have the vocabulary to prompt them, including:

Literal Characters
Metacharacters (including how to escape special characters)
Character classes (specifying groups of characters to match)
Capture Groups, including named capture groups
Quantifiers
Logical expressions
Anchors (matching positions)

Knowing these general concepts will help you with your LLM prompts. And always test your regular expressions to make sure that they are capturing the file patterns you expect.

6.5 Batching on HPC

Now we can start to do more advanced things on the HPC cluster: use one machine to process each file. We will use a slightly different mechanism to cycle through files: the .

Let’s start out with scripts.

6.5.1 SLURM Scripts

SLURM scripts are a special kind of shell script that contain additional information for the SLURM manager. This includes:

Number of nodes (machines) to request
Memory and CPU requirements for each machine

We specify these using a special kind of comment: SLURM directives. Directives begin a line with #SBATCH:

#SBATCH --nodes=1

In this example, we are specifying the number of nodes.

Note that because directives begin with a # - they are treated like comments by bash, but are usable by SLURM.n

6.5.2 SLURM Directives

We are able to set some configuration on running our jobs.

#| eval: false
#| filename: scripts/week3/sbatch_test.sh
#!/bin/bash
1#SBATCH --nodes=3
2#SBATCH --array=1-3
echo "${SLURM_ARRAY_TASK_ID} job"

1: request 1 node
2: start an array

More about directives

Much more information about the kinds of directives that you can specify in a SLURM script is available here: https://www.osc.edu/supercomputing/batch-processing-at-osc/slurm_directives_summary

The most important directive you should be aware of is how many nodes you need to request, and how much memory your jobs will need. gizmo does not use the memory parameter. Instead, use the number of cores as a proxy for memory. Much more information is here https://sciwiki.fredhutch.org/scicomputing/compute_jobs/#memory

All About Jobs

There is a ton of information that I’m not including about the gizmo cluster. For much more info on submitting jobs, please refer to https://sciwiki.fredhutch.org/scicomputing/compute_jobs/.

6.5.3 Job Arrays

This line:

#SBATCH --array=1-3

Will create a job array. This will create a variable called $SLURM_ARRAY_TASK_ID that will cycle through the numbers 1-3. Each Task ID corresponds to a different subjob. Let’s try a simpler script to show what’s going on:

#| eval: false
#| filename: sbatch_test.sh
#!/bin/bash
#SBATCH --array=1-3
#SBATCH --nodes=1
echo "${SLURM_ARRAY_TASK_ID} job"

This is a minimal script that will execute 3 subjobs. It will cycle through the job array and print the array number for each job.

#| eval: false
sbatch sbatch_test.sh

On submitting, we will get a message like this (your job number will be different):

Submitted batch job 26328834

This will run very quickly on the three nodes. And if we look for the output files:

ls -l slurm-26328834*

We will get the following output:

-rw-rw---- 1 tladera2 g_tladera2 8 Jul 15 13:50 slurm-26328834_1.out
-rw-rw---- 1 tladera2 g_tladera2 8 Jul 15 13:50 slurm-26328834_2.out
-rw-rw---- 1 tladera2 g_tladera2 8 Jul 15 13:50 slurm-26328834_3.out

Taking a look at one of these files using cat:

cat slurm-26328834_3.out

We’ll see this:

3 job

graph TD
  A["sbatch sbatch_test.sh"] --"1"--> B
  B["echo 1 job"]
  A --"2"--> C["echo 2 job"]
  A --"3"--> D["echo 3 job"]

What happened here? sbatch submitted our job array as 3 different subjobs to 3 different nodes under a single job id. Each node then outputs a file with the subjob id that contains the job number.

6.5.4 Try it out

Try running

sbatch ./scripts/week3/sbatch_test.sh

And look at the resulting .out files.

6.5.5 Processing lists of files using Job Arrays

So now we know that ${SLURM_ARRAY_TASK_ID} will let us specify a subjob within our script, how do we use it in our script?

Say, we have a list of 3 files in our directory data/, and we can list them by using ../../data/*.fastq. We can use the ${SLURM_ARRAY_TASK_ID} as an index to specify a different file.

The one caveat is that we need to know the number of files beforehand.

This script will run our run_bwa.sh on 3 separate files on 3 separate nodes:

#| eval: false
#| filename: scripts/week3/run_sbatch.sh
#!/bin/bash
1#SBATCH --nodes=3
2#SBATCH --array=1-3
3#SBATCH --mem-per-cpu=1gb
4#SBATCH --time=00:10:00
5file_array=(../../data/*.fastq)
ind=$((SLURM_ARRAY_TASK_ID-1))
current_file=${file_array[$ind]}
./run_bwa.sh $current_file

1: Initialize job array to range from 1 to 10
2: list all files in ../..data/ with extension .fastq. Assign it to the array $file_array.
3: For current task id, calculate the appropriate index (we have to subtract 1 because bash arrays begin at 0)
4: Pull the current file path given the task id, assign it to $current_file.
5: Run run_bwa.sh on $current_file.

We run this on rhino using this command (we are in ~/bash_for_bio/scripts/week3/)

sbatch sbatch_bwa.sh

We’ll get the response:

Submitted batch job 35300989

If we take a look at the job queue using squeue, we’ll get something like this:

squeue -u tladera2

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
        35300989_1 campus-ne sbatch_b tladera2  R       0:10      1 gizmok34
        35300989_2 campus-ne sbatch_b tladera2  R       0:10      1 gizmok40
        35300989_3 campus-ne sbatch_b tladera2  R       0:10      1 gizmok79

We can see our three subjobs, which are indicated by the Job IDs 35300989_1, 35300989_2, and 35300989_3.

ladera2@rhino02:~/bash_for_bio/scripts/week3$ ls -l

total 312
-rw-rw---- 1 tladera2 g_tladera2 211895 Sep 26 10:29 CALU1_combined_final.sam
-rw-rw---- 1 tladera2 g_tladera2 160489 Sep 26 10:29 HCC4006_final.sam
-rw-rw---- 1 tladera2 g_tladera2 121509 Sep 26 10:29 MOLM13_combined_final.sam
-rwxrwxrwx 1 tladera2 g_tladera2    590 Sep 26 09:45 run_bwa.sh
-rwxrwxrwx 1 tladera2 g_tladera2    256 Sep 26 10:28 run_sbatch.sh
-rw-rw---- 1 tladera2 g_tladera2    614 Sep 26 10:29 slurm-35300992_1.out
-rw-rw---- 1 tladera2 g_tladera2    579 Sep 26 10:29 slurm-35300992_2.out
-rw-rw---- 1 tladera2 g_tladera2    619 Sep 26 10:29 slurm-35300992_3.out

And you’ll see we generated our SAM files for each sample! Neat. There are also the .out files from each subjob as well, which will contain the for each subjob.

You can see that we outputted our files to the scripts/week3/ directory. It’s part of your job in the exercises to adapt run_sbatch.sh and run_bwa.sh to output to a directory of your choosing.

Why isn’t my job launching?

The gizmo cluster is used by a lot of people at the Hutch, so it gets busy from time to time.

Don’t worry, your jobs will be eventually processed if your requests are reasonable.

f you want to look at the jobs that a particular user is running, you can use the -u flag. Try squeue -u on one of the users in the queue, for example tladera2:

squeue -u tladera2

You will usually use squeue -u on your own username, so you can see the status of your jobs. This is an example from my previous career at OHSU (I didn’t have time to generate one on gizmo)

            JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           3970834 very_long KIRC.bla    wooma  R 11-04:21:09      1 exanode-7-15
           3970835 very_long OV.blast    wooma  R 11-04:20:32      1 exanode-7-5

Let’s take a look at the output. You can see some pretty useful info: the JOBID, what PARTITION the job is running under, and the STatus.

STatus is really important, because it can tell you whether your job is:

R (Running), PD (Pending), ST (Stopped), S (Suspended), CD (Completed).

R or PD status is what you want to see, because that means it’s in the queue and will be executed.

For much more info, please check out https://sciwiki.fredhutch.org/scicomputing/compute_jobs/#job-priority and https://sciwiki.fredhutch.org/scicomputing/compute_jobs/#why-isnt-my-job-running.

6.5.6 `scancel`ing a jobarray

As we noted, one of the strengths in using a job array to process multiple files is that they are spawned as sub or child jobs of a parent job id.

What if we made a mistake? We can use the scancel command to cancel the entire set of jobs by giving it our parent job id:

scancel 26328834

This will cancel all sub jobs related to the parent job. No fuss, no muss.

6.6 What’s Next?

Next week we will discuss using MiniWDL (a workflow manager) to process files through multi-step workflows in .