2  Introduction to Scripting

2.1 What we’re working towards

By the end of this session, you should be able to understand and run this shell script.

1#!/bin/bash
2module load SAMtools/1.19.2-GCC-13.2.0
3samtools view -c $1 > $1.counts.txt
4module purge
1
Specify the shell language (bash)
2
load the module
3
run the script
4
purge the module.

It seems a little intimidating, but we will take this apart line by line.

2.2 Motivation

There is a rule in programming: if you do something more than 3 times, you should consider making it into a script or function.

For example, imagine that you use samtools view -c all the time with certain options and you want to save the output. You can put this command and options into a shell script that takes named files as an argument (such as samcount.sh. Instead of typing samtools stat over and over again, you can run

./samcount.sh my_file.bam

and it will output a file my_file.bam.counts.txt that has the counts we are interested in.

2.3 The first line: the she-bang

What’s this first line?

#!/bin/bash

the #! is known as a she-bang - it’s a signal to Linux what language interpreter to use when running the script on the command line. In our case, we want to use bash. But we can also make python scripts executable by putting the path to python:

#!/

The she-bang is necessary if you want to run the script without using the bash command (after you have made it executable):

./samcount.sh chr1.sam

Note that you can always use bash to execute a file if you don’t have execute access:

bash samcount.sh chr1.sam

2.4 Software Modules

Ok, we’ve gotten comfortable navigating around the HPC filesystem. Now how do we run executables on files?

Let’s talk about the two problems:

  1. How do we find executables on a cluster, and
  2. how do we load them up and run them?

2.4.1 Is my software already installed?

Say we want to see if samtools is installed on our HPC. One of the key commands you can use to find software is the which command. If your software is installed, which will give you the path where the software is installed. For example, I can see if bash is installed:

which bash

Which gives me the response:

/bin/bash

So, let’s see if samtools is installed:

which samtools

Which gives no response, so where is samtools?

If we don’t have samtools immediately available, how do we find it on our system? On the HPC system, We can use environment modules to load software.

2.4.2 Environment Modules

Before you install your own versions of software, it’s important to realize that this problem may be solved for you.

Your first stop should be looking for environment modules on the HPC. Not all HPCs have these, but if they have them, this should be your first stop to find executables.

lmod is a system for loading and unloading software modules. It is usually installed on HPCs. The commands all start with module, and there are a number of ones that are useful for you.

  • module avail
  • module load
  • module purge

If you want to see the current list of available modules and their names, check them out here.

Looking for samtools on that page, we discovered the name of our module:

SAMtools

So, that’s what we’ll use to load up samtools.

2.4.3 module load

Here’s the next line of the script:

module load SAMtools/1.19.2-GCC-13.2.0  #load the module

Our module name is SAMtools, and the 1.19.2-GCC-13.2.0 after it is the version of that module. In general, it is better to specify a version when running software, because there may be breaking changes in newer versions of the software.

If we do not specify the version number, then module load will load the default (notated by a (D)) version of the module.

For FH Users: Modules benefit everyone

If there is a particular bit of software that you need to run on the FH cluster that’s not there, make sure to request it from SciComp. Someone else probably needs it and so making it known so they can add it as a Environment module will help other people.

For FH Users

On the FH cluster, ml is a handy command that combines module load and module avail.

You can load a module with ml <module_name>.

2.4.4 Tip: Load only as many modules as you need at a time

One of the big issues with bioinformatics software is that the toolchain (the software dependencies needed to run the software) can be different for different executables. So when possible, load only one or two modules at a time for each step of your analysis. When you’re done with that step, use module purge to clear out the software environment.

2.4.5 For more information

Please refer to https://sciwiki.fredhutch.org/scicomputing/compute_environments/ for more information about the available modules.

2.5 $1: A Positional argument

The next line of our script is this:

samtools view -c $1 > $1.counts.txt  

Let’s take a look at the command that we’re running first. We’re going to run samtools view -c, which will give us counts on an incoming bam or sam file and save it in a file. We want to be able to run our script like this:

bash samtools_count.sh my_file.bam 

When we run it like that, samtools_count.sh will run samtools view -c like this:

samtools view -c my_file.bam > my_file.bam.counts.txt

So what’s going on here is that there is some substitution using common arguments. Let’s look at these.

> - redirecting outputs to a file

The > in the script means that we are going to direct the output of samtools view -c into a file.

If we didn’t do this, samtools_count.sh would output everything to console.

Much more info about this when we talk about the different outputs to console.

2.5.1 Positional Arguments such as $1

How did the script know where to substitute each of our arguments? It has to do with the argument variables. Arguments (terms that follow our command) are indexed starting with the number 1. We can access the value at the first position using the special variable $1.

Note that this works even in quotes.

So, to unpack our script, we are substituting our first argument for the $1.

When we run this script, it will output into our current directory, which is the top level of the bash_for_bio folder. We can change the behavior of the script by doing the following:

samtools view -c $1 > $2 

With the added complication that we need to specify a path for our counts file.

Test yourself

How would we rewrite sam_run.sh (shown below) if we wanted to specify the output file as the first argument and the bam file as the second argument?

#!/bin/bash/
samtools view -c $1 > $2

For this script, we would switch the positions of $1 and $2.

#!/bin/bash/
samtools stats $2 > $1

2.5.2 For More Info

Please refer to Bite Size Bash page 9 for more info.

2.6 module purge

The last line of our script is:

module purge

This line will unload the modules from memory. It’s good practice to unload modules when you’re done with them, especially since they have complex chains of dependencies, and the versions of these dependencies can interfere with other packages.

Processing Files Best Practices

One thing to remember is to not touch the raw data. The original files should remain untouched.

A good way to do this is to have your outputs saved in a different folder.

2.7 Variables in Bash Scripts

We saw a little bit about using $1, which is a variable in our Bash scripts. Let’s talk about declaring variables in bash scripts and using them using variable expansion.

In Bash, we can declare a variable by using <variable_name>=<value>. Note there are no spaces between the variable (my_variable), equals sign, and the value ("ggplot2").

my_variable="ggplot2"

echo "My favorite R package is ${my_variable}"
My favorite R package is ggplot2

Take a look at line 3 above. We expand the variable (that is, we substitute the actual variable) by using ${my_variable} in our echo statement.

In general, when expanding a variable in a quoted string, it is better to use ${my_variable} (the variable name in curly brackets). This is especially important when using the variable name as part of a string:

my_var="chr1"
echo "${my_var}_1.vcf.gz"
chr1_1.vcf.gz

If we didn’t use the braces here, like this:

echo "$my_var_1.vcf.gz"

Bash would look for the variable $my_var_1, which doesn’t exist. So use the curly braces {} when you expand variables. It’s safer overall.

There is an alternate method for variable expansion which we will use when we call a sub-shell - a shell within a shell. We need to use parentheses () to expand them within the sub-shell, but not the top-shell. We’ll use this when we process multiple files within a single worker.

2.7.1 For more information

Please refer to Bite Size Bash page 7 for more information.

2.8 Putting it all together

We can run the script on a file:

./scripts/samcount.sh data/my_sam_file.sam
Try it Out

2.9 Aligning a file

How about running the bwa-mem aligner? Let’s run a FASTQ file into it:

#!/bin/bash
module load BWA/0.7.17-GCCcore-11.2.0
REF_GENOME="/shared/biodata/reference/iGenomes/Homo_sapiens/UCSC/hg19/Sequence/BWAIndex/genome.fa"
bwa mem ${REF_GENOME} $1 > $1.aligned.sam
module purge

2.10 Arguments in R Scripts

To make our R Script more useful, it must also accept arguments. Let’s explore how to achieve this.

As we learned in (Section 1.14), We can use Rscript to execute our command non-interactively. Rscript, which is the command line executable, will run our Rs cript on the command line.

Note that we have a named argument called input_file and it is done differently than in Bash - how do we use this in our R Script?

2.10.1 Using Named Arguments in an R script

We can pass arguments from our bash script to our R script by using commandArgs() - this will populate a list of named arguments (such as CSVFILE) that are passed into the R Script. We assign the output of commandArgs() into the args object.

We refer to our CSVFILE argument as args$CSVFILE in our script.

scripts/week2/process_data.R
library(tidyverse)

args <- commandArgs()

# Use arg$CSVFILE in read.csv
1csv_file <- read.csv(file=args$CSVFILE)

# Do some work with csv_file
csv_filtered <- csv_file |> janitor::clean_names()

# Write output
2write.csv(csv_filtered, file = paste0(args$CSVFILE, "_filtered.csv"))
1
Use args$CSVFILE as an input for read.csv()
2
Use args$CSVFILE to specify the output name.

2.10.2 Running our R Script

Now that we’ve set it up, we can run the R script from the command line as follows:

Rscript process_data.R CSVFILE="LUSC_clinical.csv"

2.10.3 Integrating an R script into a Bash Script

Say you have an R Script you need to run on the command line in combination with other unix commands. In our bash script, we can do the following:

scripting-basics/wrap_r_script.sh
#!/bin/bash
module load fhR
Rscript process_data.R CSVFILE="${1}"
module purge

Then we can run

bash wrap_r_script.sh my_csvfile.csv 

In our bash script, my_bash_script.sh, we’re using positional argument (for simplicity) to specify our csvfile, and then passing the positional argument to named ones (CSVFILE) for my_r_script.R.

2.10.4 Summary

  • R scripts will accept named arguments when called by Rscript:
Rscript process_csv.R CSVFILE="my_csvfile.csv"

2.11 Arguments in Python Scripts

Similarly, we will need Python scripts to accept arguments as well.

2.11.1 Using sys.argv

The sys module (built into Python) will let us access arguments by position:

process_file.py
1import sys
import pandas as pd
2file = sys.argv[1]
df = pd.read_csv(file)
1
Import sys module
2
Use sys.argv to access the first argument sys.argv[1]

2.11.2 Running our Python script on the command line

Now we have set up our python script, we can run it on the command line on rhino:

module load fhPython
python3 process_file.py lusc_file.csv

And we could wrap it up in a bash script:

wrap_py_script.sh
#!/bin/bash
module load fhPython
python3 process_file.py ${1}
module purge

2.12 Using pipes: STDIN, STDOUT, STDERR

We will need to use pipes to chain our commands together. Specifically, we need to take a command that generates a list of files on the cluster shared filesystem, and then spawns individual jobs to process each file. For this reason, understanding a little bit more about how pipes (|) work in Bash is helpful.

If we want to understand how to chain our scripts together into a pipeline, it is helpful to know about the different streams that are available to the utilities.

graph LR
  A(STDIN) --> E[run_samtools.sh]
  E --> B(STDOUT)
  E --> C(STDERR)

Figure 2.1: Inputs/outputs to a script

Every script has three streams available to it: Standard In (STDIN), Standard Out (STDOUT), and Standard Error (STDERR) (Figure 9.1).

STDIN contains information that is directed to the input of a script (usually text output via STDOUT from another script). This includes arguments we pass onto the script.

Why do these matter? To work in a Unix pipeline, a script must be able to utilize STDIN, and generate STDOUT, and STDERR.

Specifically, in pipelines, STDOUT of a script (here it’s run_samtools) is directed into STDIN of another command (here wc, or word count)

graph LR
  E[run_samtools.sh] --> B(STDOUT)
  B --> F{"|"}
  E --> C(STDERR)
  F --> D("STDIN (wc)")
  D --> G[wc]

Figure 2.2: Piping a script run_samtools.sh into another command (wc)

We will mostly use STDOUT in our bash scripts, but STDERR can be really helpful in debugging what’s going wrong.

2.12.1 > (redirects)

Sometimes you want to direct the output of a script to a text file. A lot of bioinformatics tools output to standard out, and we need a way to save the results.

Enter >, a redirect. We can put > after our command followed by a file path to save the output.

samtools view -c my_bam_file.bam > counts.txt

Here we are redirecting the output of samtools view into the counts.txt file. Note that everytime that we run the script, we will overwrite the current contents of the counts.txt file. Sometimes that is not what you want.

There is another kind of redirect, >>, that will append (that is, add to the end of a file). If the file does not exist, it will be created. But if it does exist, the output will be added to the end of the file.

Much more information about redirects can be found here: https://www.geeksforgeeks.org/linux-unix/input-output-redirection-in-linux/ and in Bite Sized Bash.

Why this is important on the Cluster

We’ll use pipes and pipelines not only in starting a bunch of jobs using batch scripting on our home computer, but also when we are processing files within a job.

Pipes are at the heart of multi-stage workflows. They allow us to specify multiple steps in processing a file.

2.12.2 For more info about pipes and pipelines