3  Introduction to Scripting

3.1 Exercises

Open up the exercises here.

3.2 What we’re working towards

By the end of this session, you should be able to understand and run this shell script.

#| filename: ./scripts/week2/sam_count.sh
1#!/bin/bash
2module load SAMtools/1.19.2-GCC-13.2.0
3samtools view -c $1 > $1.counts.txt
4module purge
1
Specify the shell language (bash)
2
load the module
3
run the script
4
purge the module.

It seems a little intimidating, but we will take this apart line by line.

3.3 Motivation

There is a rule in programming: if you do something more than 3 times, you should consider making it into a script or function.

For example, imagine that you use samtools view -c all the time with certain options and you want to save the output. You can put this command and options into a shell script that takes named files as an argument (such as samcount.sh. Instead of typing samtools view -c over and over again, you can run

./sam_count.sh my_file.bam

and it will output a file my_file.bam.counts.txt that has the counts we are interested in.

3.4 The first line: the she-bang

What’s this first line?

#!/bin/bash

the #! is known as a she-bang - it’s a signal to Linux what language interpreter to use when running the script on the command line. In our case, we want to use bash. But we can also make python scripts executable by putting the path to python:

#!/

The she-bang is necessary if you want to run the script without using the bash command (after you have made it executable):

./samcount.sh chr1.sam

Note that you can always use bash to execute a file if you don’t have execute access:

bash samcount.sh chr1.sam

3.5 Software Modules

Ok, we’ve gotten comfortable navigating around the HPC filesystem. Now how do we run executables on files?

Let’s talk about the two problems:

  1. How do we find executables on a cluster, and
  2. how do we load them up and run them?

3.5.1 Is my software already installed?

Say we want to see if samtools is installed on our HPC. One of the key commands you can use to find software is the which command. If your software is installed, which will give you the path where the software is installed. For example, I can see if bash is installed:

which bash

Which gives me the response:

/bin/bash

So, let’s see if samtools is installed:

which samtools

Which gives no response, so where is samtools?

If we don’t have samtools immediately available, how do we find it on our system? On the HPC system, We can use environment modules to load software.

3.5.2 Environment Modules

Before you install your own versions of software, it’s important to realize that this problem may be solved for you.

Your first stop should be looking for environment modules on the HPC. Not all HPCs have these, but if they have them, this should be your first stop to find executables.

lmod is a system for loading and unloading software modules. It is usually installed on HPCs. The commands all start with module, and there are a number of ones that are useful for you.

  • module avail
  • module load
  • module purge

If you want to see the current list of available modules and their names, check them out here.

Looking for samtools on that page, we discovered the name of our module:

SAMtools

So, that’s what we’ll use to load up samtools.

3.5.3 module avail

We can find the versions of SAMtools available to us by using:

module avail SAMtools

And we’ll get a list of versions:

------------------------------- /app/modules/all -------------------------------
   SAMtools/1.10-GCC-8.3.0        SAMtools/1.16.1-GCC-11.3.0
   SAMtools/1.10-GCCcore-8.3.0    SAMtools/1.17-GCC-12.2.0
   SAMtools/1.11-GCC-10.2.0       SAMtools/1.18-GCC-12.3.0
   SAMtools/1.14-GCC-11.2.0       SAMtools/1.19.2-GCC-13.2.0
   SAMtools/1.15.1-GCC-11.2.0     SAMtools/1.21-GCC-13.3.0   (D)
   SAMtools/1.16.1-GCC-11.2.0

The (D) is the default version. It is the version we use if we just use:

module load SAMtools

Without a version. In general, it is good to fix the module version in your scripts for reproducibility reasons. The reason for that is that the default version may change in the future, and there may be changes that break your code.

3.5.4 module load

Here’s the next line of the script:

module load SAMtools/1.19.2-GCC-13.2.0  #load the module

Our module name is SAMtools, and the 1.19.2-GCC-13.2.0 after it is the version of that module.

NoteFor FH Users: Modules benefit everyone

If there is a particular bit of software that you need to run on the FH cluster that’s not there, make sure to request it from SciComp. Someone else probably needs it and so making it known so they can add it as a Environment module will help other people.

NoneFor FH Users

On the FH cluster, ml is a handy command that combines module load and module avail.

You can load a module with ml <module_name>.

3.5.5 Tip: Load only as many modules as you need at a time

One of the big issues with bioinformatics software is that the toolchain (the software dependencies needed to run the software) can be different for different executables. This is also called dependency hell. So when possible, load only one or two modules at a time for each step of your analysis. When you’re done with that step, use module purge to clear out the software environment.

3.5.6 For more information

Please refer to https://sciwiki.fredhutch.org/scicomputing/compute_environments/ for more information about the available modules.

3.6 $1: A Positional argument

The next line of our script is this:

samtools view -c $1 > $1.counts.txt  

Let’s take a look at the command that we’re running first. We’re going to run samtools view -c, which will give us counts on an incoming bam or sam file and save it in a file. We want to be able to run our script like this:

bash sam_count.sh my_file.bam 

When we run it like that, sam_count.sh will run samtools view -c like this:

samtools view -c my_file.bam > my_file.bam.counts.txt

So what’s going on here is that there is some substitution using common arguments. Let’s look at these.

None> - redirecting outputs to a file

The > in the script means that we are going to direct the output of samtools view -c into a file.

If we didn’t do this, samtools_count.sh would output everything to console.

Much more info about this when we talk about the different outputs to console (Section 3.12).

3.6.1 Positional Arguments such as $1

How did the script know where to substitute each of our arguments? It has to do with the argument variables. Arguments (terms that follow our command) are indexed starting with the number 1. We can access the value at the first position using the special variable $1.

Note that this works even in quotes.

So, to unpack our script, we are substituting our first argument for the $1.

When we run this script, it will output into our current directory, which is the top level of the bash_for_bio folder. We can change the behavior of the script by doing the following:

samtools view -c $1 > $2 

With the added complication that we need to specify a path for our counts file.

NoteTest yourself

How would we rewrite sam_run.sh (shown below) if we wanted to specify the output file as the first argument and the bam file as the second argument?

#!/bin/bash/
samtools view -c $1 > $2

For this script, we would switch the positions of $1 and $2.

#!/bin/bash/
samtools view -c $2 > $1

3.6.2 For More Info

Please refer to Bite Size Bash page 9 for more info.

3.7 module purge

The last line of our script is:

module purge

This line will unload the modules from memory. It’s good practice to unload modules when you’re done with them, especially since they have complex chains of dependencies, and the versions of these dependencies can interfere with other packages.

NoneProcessing Files Best Practices

One thing to remember is to not touch the raw data. The original files should remain untouched. That is, reading the raw data files is fine, but avoid overwriting the orignal data files.

A good way to do this is to have your outputs saved in a different folder.

3.8 shellcheck

When you are writing shell scripts, shellcheck is a really handy way to check that you have syntax correctly.

1#/bin/bash
echo "blah blah"
1
A mistake! should be #!/bin/bash.

If we run shellcheck on this script, it will catch the mistake:

ml shellcheck   #load the shellcheck module
shellcheck scripts/week2/bad_script.sh 

shellcheck will give the following response:

In bad_script.sh line 1:
#/bin/bash
 ^-- SC1113 (error): Use #!, not just #, for the shebang.

shellcheck is available on rhino via ml, you can also install the VSCode Extension if you are editing in VSCode.

For more info about shellcheck, refer to Bite Sized Bash, page 6.

NoteWhen you have spaces in file names: escaping characters or double quotes

In general, I try not to use spaces in my file names. They can be unwieldy.

The general solution is to use double quotes or an escape character \

For example, if my file name is CALU final.fasta

I can refer to it as:

./scripts/week2/run_bwa.sh "CALU final.fasta"

The escape character (\) is a way to put spaces in a bash command without bash interpreting everything after the space as an argument. Essentially, it tells bash to literally interpret the character that follows it.

./scripts/week2/run_bwa.sh CALU\ final.fasta

I prefer to use double quotes. You’ll notice that all file names in this course use an underscore (_) or dashes (-) instead of spaces, which makes them easier to handle.

3.9 Variables in Bash Scripts

We saw a little bit about using $1, which is a variable in our Bash scripts. Let’s talk about declaring variables in bash scripts and using them using variable expansion.

In Bash, we can declare a variable by using <variable_name>=<value>. Note there are no spaces between the variable (my_variable), equals sign, and the value ("ggplot2").

my_variable="ggplot2"

echo "My favorite R package is ${my_variable}"
My favorite R package is ggplot2

Take a look at line 3 above. We expand the variable (that is, we substitute the actual variable) by using ${my_variable} in our echo statement.

In general, when expanding a variable in a quoted string, it is better to use ${my_variable} (the variable name in curly brackets). This is especially important when using the variable name as part of a string:

my_var="chr1"
echo "${my_var}_1.vcf.gz"
chr1_1.vcf.gz

If we didn’t use the braces here, like this:

echo "$my_var_1.vcf.gz"

Bash would look for the variable $my_var_1, which doesn’t exist. So use the curly braces {} when you expand variables. It’s safer overall.

NoteSubshells

There is an alternate method for variable expansion which we will use when we call a sub-shell - a shell within a shell. We need to use parentheses () to expand these commands within the sub-shell, but not the top-shell. We’ll use this when we process multiple files within a single worker.

3.9.1 For more information

Please refer to Bite Size Bash page 7 for more information.

3.10 Putting it all together

We can run the script on a file (assuming we are in bash_for_bio/):

./scripts/week2/sam_count.sh ./data/my_file.bam

3.10.1 Try it Out

Try out the above script on ./data/CALU1_combined_final.sam

Make sure you are in the top folder (bash_for_bio) when you do it.

NoteWhen should you hard code a path?

In general, hard coding absolute paths (such as /home/laderast/bash_for_bio/scripts/week1/run_this.sh) makes your code less reproducible. Using a project-based workflow (where your paths mostly refer within a single folder) makes it more reproducible.

Of course, there are cases (such as hard coding a genome reference) which should use absolute paths, since the location of these differs across systems. Just make a note when sharing which paths are hard-coded.

I talk about the benefits of project based workflows here: (Section 12.10).

3.11 Aligning a file

How about running the bwa-mem aligner? Let’s run a FASTQ file into it using ${1}

#| filename: scripts/week2/run_bwa.sh
#!/bin/bash
1module load BWA/0.7.17-GCCcore-11.2.0
2input_fastq=${1}
# strip path and suffix
base_file_name="${input_fastq%.fastq}"
base_file_name=${base_file_name##*/}
echo "running $input_fastq"
sample_name="SM:${base_file_name}"
read_group_id="ID:${base_file_name}"
platform_info="PL:Illumina"
ref_fasta_local="/shared/biodata/reference/iGenomes/Homo_sapiens/UCSC/hg19/Sequence/BWAIndex/genome.fa" <2>

3bwa mem \
      -p -v 3 -M \
      -R "@RG\t${read_group_id}\t${sample_name}\t${platform_info}" \
      "${ref_fasta_local}" "${input_fastq}" > \
      "${base_file_name}.sam"
module purge
1
Load BWA module. Note that we are using a specific module version.
2
Set a variable called $ref_fasta_local, using the location of the hg19 reference.
3
Run bwa mem on the $input_fastq file, using the $ref_fasta_local variable as the location of the reference genome, and output to ${base_file_name}.sam.

3.11.1 Try it Out

We can run this script using:

./scripts/week2/run_bwa.sh ./data/CALU1_combined_final.fasta

Where does the output file end up?

3.12 Using pipes: STDIN, STDOUT, STDERR

We will need to use pipes to chain our commands together. Specifically, we need to take a command that generates a list of files on the cluster shared filesystem, and then spawns individual jobs to process each file. For this reason, understanding a little bit more about how pipes (|) work in Bash is helpful.

If we want to understand how to chain our scripts together into a pipeline, it is helpful to know about the different streams that are available to the utilities.

graph LR
  A(STDIN) --> E[run_samtools.sh]
  E --> B(STDOUT)
  E --> C(STDERR)

Figure 3.1: Inputs/outputs to a script

Every script has three streams available to it: Standard In (STDIN), Standard Out (STDOUT), and Standard Error (STDERR) (Figure 12.1).

STDIN contains information that is directed to the input of a script (usually text output via STDOUT from another script). This includes arguments we pass onto the script.

Why do these matter? To work in a Unix pipeline, a script must be able to utilize STDIN, and generate STDOUT, and STDERR.

Specifically, in pipelines, STDOUT of a script (here it’s run_bwa.sh) is directed into STDIN of another command (here wc, or word count)

graph LR
  E[run_bwa.sh] --> B(STDOUT)
  B --> F{"|"}
  E --> C(STDERR)
  F --> D("STDIN (wc)")
  D --> G[wc]

Figure 3.2: Piping a script run_samtools.sh into another command (wc)

We will mostly use STDOUT in our bash scripts, but STDERR can be really helpful in debugging what’s going wrong.

3.12.1 > (redirects)

Sometimes you want to direct the output of a script (STDOUT) to a text file. A lot of bioinformatics tools output to STDOUT, and we need a way to save the results, or pass the results onto another program.

Enter >, a redirect. We can put > after our command followed by a file path to save the output.

samtools view -c my_bam_file.bam > counts.txt

Here we are redirecting the output of samtools view into the counts.txt file. Note that everytime that we run the script, we will overwrite the current contents of the counts.txt file. Sometimes that is not what you want.

There is another kind of redirect, >>, that will append (that is, add to the end of a file). If the file does not exist, it will be created. But if it does exist, the output will be added to the end of the file. I rarely use this, however.

Much more information about redirects can be found here: https://www.geeksforgeeks.org/linux-unix/input-output-redirection-in-linux/ and in Bite Sized Bash (page 13).

NoteWhy this is important on the Cluster

We’ll use pipes and pipelines not only in starting a bunch of jobs using batch scripting on our home computer, but also when we are processing files within a job.

Pipes are at the heart of multi-stage workflows. They allow us to specify multiple steps in processing a file.

3.12.2 For more info about pipes and pipelines

3.13 Arguments in R Scripts

To make our R Script more useful, it must also accept arguments. Let’s explore how to achieve this.

As we learned in (Section 1.15), We can use Rscript to execute our command non-interactively. Rscript, which is the command line executable, will run our Rs cript on the command line.

Note that we have a named argument called input_file and it is done differently than in Bash - how do we use this in our R Script?

3.13.1 Using Named Arguments in an R script

We can pass arguments from our bash script to our R script by using commandArgs() - this will populate a list of named arguments (such as input_file) that are passed into the R Script. We assign the output of commandArgs() into the args object.

We refer to our input_file argument as args$input_file in our script.

scripts/week2/process_data.R
library(tidyverse)

args <- commandArgs()

# Use arg$CSVFILE in read.csv
1csv_file <- read.csv(file=args$input_file)

# Do some work with csv_file
csv_filtered <- csv_file |> janitor::clean_names()

# Write output
2write.csv(csv_filtered, file = paste0(args$CSVFILE, "_cleaned.csv"))
1
Use args$input_file as an input for read.csv()
2
Use args$input_file to specify the output name.

3.13.2 Try it out: Running our R Script

Now that we’ve set it up, we can run the R script from the command line as follows:

module load fhR
Rscript process_data.R input_file="LUSC_clinical.csv"
module purge

3.13.3 Integrating an R script into a Bash Script

Say you have an R Script you need to run on the command line in combination with other unix commands. In our bash script, we can do the following:

./scripts/week2/wrap_r_script.sh
#!/bin/bash
module load fhR
Rscript process_data.R input_file="${1}"
module purge

Then we can run

bash wrap_r_script.sh .data/PRAD.csv 

In our bash script, my_bash_script.sh, we’re using a positional argument (for simplicity) to specify our csvfile, and then passing the positional argument to named ones (input_file) for my_r_script.R.

3.13.4 Summary

  • R scripts will accept named arguments when called by Rscript:
Rscript process_csv.R input_file="my_csvfile.csv"

3.14 Arguments in Python Scripts

Similarly, we will need Python scripts to accept arguments as well.

3.14.1 Using sys.argv

The sys module (built into Python) will let us access arguments by position:

[argparse]

process_file.py
1import sys
import pandas as pd
2file = sys.argv[1]
df = pd.read_csv(file)
1
Import sys module
2
Use sys.argv to access the first argument sys.argv[1]

If you know a little bit about Python, you’ll know that arrays in python are zero-indexed, that is, they start at [0], but we have started here at sys.argv[1] - what is at sys.argv[0]?

It turns out that sys.argv[0] is the name of the script. In the above script process_file.py, it contains the string "process_file.py". Having the name of the script can come in handy, especially when outputting things.

3.14.2 Running our Python script on the command line

Now we have set up our python script, we can run it on the command line on rhino:

module load fhPython
python3 process_file.py lusc_file.csv
module purge

And we could wrap it up in a bash script:

./scripts/week2/wrap_py_script.sh
#!/bin/bash
module load fhPython
python3 process_file.py ${1}
module purge

3.15 Next Time

We’ll work on batch processing, both on a local system and the HPC cluster. Make sure to read the HPC Basic chapter before class.