graph LR A(STDIN) --> E[run_samtools.sh] E --> B(STDOUT) E --> C(STDERR)
3 Introduction to Scripting
3.1 Exercises
3.2 What we’re working towards
By the end of this session, you should be able to understand and run this shell script.
- 1
- Specify the shell language (bash)
- 2
- load the module
- 3
- run the script
- 4
- purge the module.
It seems a little intimidating, but we will take this apart line by line.
3.3 Motivation
There is a rule in programming: if you do something more than 3 times, you should consider making it into a script or function.
For example, imagine that you use samtools view -c all the time with certain options and you want to save the output. You can put this command and options into a shell script that takes named files as an argument (such as samcount.sh. Instead of typing samtools view -c over and over again, you can run
./sam_count.sh my_file.bamand it will output a file my_file.bam.counts.txt that has the counts we are interested in.
3.4 The first line: the she-bang
What’s this first line?
#!/bin/bashthe #! is known as a she-bang - it’s a signal to Linux what language interpreter to use when running the script on the command line. In our case, we want to use bash. But we can also make python scripts executable by putting the path to python:
#!/The she-bang is necessary if you want to run the script without using the bash command (after you have made it executable):
./samcount.sh chr1.samNote that you can always use bash to execute a file if you don’t have execute access:
bash samcount.sh chr1.sam
3.5 Software Modules
Ok, we’ve gotten comfortable navigating around the HPC filesystem. Now how do we run executables on files?
Let’s talk about the two problems:
- How do we find executables on a cluster, and
- how do we load them up and run them?
3.5.1 Is my software already installed?
Say we want to see if samtools is installed on our HPC. One of the key commands you can use to find software is the which command. If your software is installed, which will give you the path where the software is installed. For example, I can see if bash is installed:
which bashWhich gives me the response:
/bin/bash
So, let’s see if samtools is installed:
which samtools
Which gives no response, so where is samtools?
If we don’t have samtools immediately available, how do we find it on our system? On the HPC system, We can use environment modules to load software.
3.5.2 Environment Modules
Before you install your own versions of software, it’s important to realize that this problem may be solved for you.
Your first stop should be looking for environment modules on the HPC. Not all HPCs have these, but if they have them, this should be your first stop to find executables.
lmod is a system for loading and unloading software modules. It is usually installed on HPCs. The commands all start with module, and there are a number of ones that are useful for you.
module availmodule loadmodule purge
If you want to see the current list of available modules and their names, check them out here.
Looking for samtools on that page, we discovered the name of our module:
SAMtools
So, that’s what we’ll use to load up samtools.
3.5.3 module avail
We can find the versions of SAMtools available to us by using:
module avail SAMtoolsAnd we’ll get a list of versions:
------------------------------- /app/modules/all -------------------------------
SAMtools/1.10-GCC-8.3.0 SAMtools/1.16.1-GCC-11.3.0
SAMtools/1.10-GCCcore-8.3.0 SAMtools/1.17-GCC-12.2.0
SAMtools/1.11-GCC-10.2.0 SAMtools/1.18-GCC-12.3.0
SAMtools/1.14-GCC-11.2.0 SAMtools/1.19.2-GCC-13.2.0
SAMtools/1.15.1-GCC-11.2.0 SAMtools/1.21-GCC-13.3.0 (D)
SAMtools/1.16.1-GCC-11.2.0
The (D) is the default version. It is the version we use if we just use:
module load SAMtools
Without a version. In general, it is good to fix the module version in your scripts for reproducibility reasons. The reason for that is that the default version may change in the future, and there may be changes that break your code.
3.5.4 module load
Here’s the next line of the script:
module load SAMtools/1.19.2-GCC-13.2.0 #load the moduleOur module name is SAMtools, and the 1.19.2-GCC-13.2.0 after it is the version of that module.
If there is a particular bit of software that you need to run on the FH cluster that’s not there, make sure to request it from SciComp. Someone else probably needs it and so making it known so they can add it as a Environment module will help other people.
3.5.5 Tip: Load only as many modules as you need at a time
One of the big issues with bioinformatics software is that the toolchain (the software dependencies needed to run the software) can be different for different executables. This is also called dependency hell. So when possible, load only one or two modules at a time for each step of your analysis. When you’re done with that step, use module purge to clear out the software environment.
3.5.6 For more information
Please refer to https://sciwiki.fredhutch.org/scicomputing/compute_environments/ for more information about the available modules.
3.6 $1: A Positional argument
The next line of our script is this:
samtools view -c $1 > $1.counts.txt Let’s take a look at the command that we’re running first. We’re going to run samtools view -c, which will give us counts on an incoming bam or sam file and save it in a file. We want to be able to run our script like this:
bash sam_count.sh my_file.bam When we run it like that, sam_count.sh will run samtools view -c like this:
samtools view -c my_file.bam > my_file.bam.counts.txtSo what’s going on here is that there is some substitution using common arguments. Let’s look at these.
3.6.1 Positional Arguments such as $1
How did the script know where to substitute each of our arguments? It has to do with the argument variables. Arguments (terms that follow our command) are indexed starting with the number 1. We can access the value at the first position using the special variable $1.
Note that this works even in quotes.
So, to unpack our script, we are substituting our first argument for the $1.
When we run this script, it will output into our current directory, which is the top level of the bash_for_bio folder. We can change the behavior of the script by doing the following:
samtools view -c $1 > $2 With the added complication that we need to specify a path for our counts file.
How would we rewrite sam_run.sh (shown below) if we wanted to specify the output file as the first argument and the bam file as the second argument?
#!/bin/bash/
samtools view -c $1 > $2For this script, we would switch the positions of $1 and $2.
#!/bin/bash/
samtools view -c $2 > $13.6.2 For More Info
Please refer to Bite Size Bash page 9 for more info.
3.7 module purge
The last line of our script is:
module purge
This line will unload the modules from memory. It’s good practice to unload modules when you’re done with them, especially since they have complex chains of dependencies, and the versions of these dependencies can interfere with other packages.
3.8 shellcheck
When you are writing shell scripts, shellcheck is a really handy way to check that you have syntax correctly.
1#/bin/bash
echo "blah blah"- 1
-
A mistake! should be
#!/bin/bash.
If we run shellcheck on this script, it will catch the mistake:
ml shellcheck #load the shellcheck module
shellcheck scripts/week2/bad_script.sh shellcheck will give the following response:
In bad_script.sh line 1:
#/bin/bash
^-- SC1113 (error): Use #!, not just #, for the shebang.
shellcheck is available on rhino via ml, you can also install the VSCode Extension if you are editing in VSCode.
For more info about shellcheck, refer to Bite Sized Bash, page 6.
In general, I try not to use spaces in my file names. They can be unwieldy.
The general solution is to use double quotes or an escape character \
For example, if my file name is CALU final.fasta
I can refer to it as:
./scripts/week2/run_bwa.sh "CALU final.fasta"The escape character (\) is a way to put spaces in a bash command without bash interpreting everything after the space as an argument. Essentially, it tells bash to literally interpret the character that follows it.
./scripts/week2/run_bwa.sh CALU\ final.fastaI prefer to use double quotes. You’ll notice that all file names in this course use an underscore (_) or dashes (-) instead of spaces, which makes them easier to handle.
3.9 Variables in Bash Scripts
We saw a little bit about using $1, which is a variable in our Bash scripts. Let’s talk about declaring variables in bash scripts and using them using variable expansion.
In Bash, we can declare a variable by using <variable_name>=<value>. Note there are no spaces between the variable (my_variable), equals sign, and the value ("ggplot2").
my_variable="ggplot2"
echo "My favorite R package is ${my_variable}"My favorite R package is ggplot2
Take a look at line 3 above. We expand the variable (that is, we substitute the actual variable) by using ${my_variable} in our echo statement.
In general, when expanding a variable in a quoted string, it is better to use ${my_variable} (the variable name in curly brackets). This is especially important when using the variable name as part of a string:
my_var="chr1"
echo "${my_var}_1.vcf.gz"chr1_1.vcf.gz
If we didn’t use the braces here, like this:
echo "$my_var_1.vcf.gz"
Bash would look for the variable $my_var_1, which doesn’t exist. So use the curly braces {} when you expand variables. It’s safer overall.
There is an alternate method for variable expansion which we will use when we call a sub-shell - a shell within a shell. We need to use parentheses () to expand these commands within the sub-shell, but not the top-shell. We’ll use this when we process multiple files within a single worker.
3.9.1 For more information
Please refer to Bite Size Bash page 7 for more information.
3.10 Putting it all together
We can run the script on a file (assuming we are in bash_for_bio/):
./scripts/week2/sam_count.sh ./data/my_file.bam3.10.1 Try it Out
Try out the above script on ./data/CALU1_combined_final.sam
Make sure you are in the top folder (bash_for_bio) when you do it.
In general, hard coding absolute paths (such as /home/laderast/bash_for_bio/scripts/week1/run_this.sh) makes your code less reproducible. Using a project-based workflow (where your paths mostly refer within a single folder) makes it more reproducible.
Of course, there are cases (such as hard coding a genome reference) which should use absolute paths, since the location of these differs across systems. Just make a note when sharing which paths are hard-coded.
I talk about the benefits of project based workflows here: (Section 12.10).
3.11 Aligning a file
How about running the bwa-mem aligner? Let’s run a FASTQ file into it using ${1}
#| filename: scripts/week2/run_bwa.sh
#!/bin/bash
1module load BWA/0.7.17-GCCcore-11.2.0
2input_fastq=${1}
# strip path and suffix
base_file_name="${input_fastq%.fastq}"
base_file_name=${base_file_name##*/}
echo "running $input_fastq"
sample_name="SM:${base_file_name}"
read_group_id="ID:${base_file_name}"
platform_info="PL:Illumina"
ref_fasta_local="/shared/biodata/reference/iGenomes/Homo_sapiens/UCSC/hg19/Sequence/BWAIndex/genome.fa" <2>
3bwa mem \
-p -v 3 -M \
-R "@RG\t${read_group_id}\t${sample_name}\t${platform_info}" \
"${ref_fasta_local}" "${input_fastq}" > \
"${base_file_name}.sam"
module purge- 1
-
Load
BWAmodule. Note that we are using a specific module version. - 2
-
Set a variable called
$ref_fasta_local, using the location of thehg19reference. - 3
-
Run
bwa memon the$input_fastqfile, using the$ref_fasta_localvariable as the location of the reference genome, and output to${base_file_name}.sam.
3.11.1 Try it Out
We can run this script using:
./scripts/week2/run_bwa.sh ./data/CALU1_combined_final.fastaWhere does the output file end up?
3.12 Using pipes: STDIN, STDOUT, STDERR
We will need to use pipes to chain our commands together. Specifically, we need to take a command that generates a list of files on the cluster shared filesystem, and then spawns individual jobs to process each file. For this reason, understanding a little bit more about how pipes (|) work in Bash is helpful.
If we want to understand how to chain our scripts together into a pipeline, it is helpful to know about the different streams that are available to the utilities.
Every script has three streams available to it: Standard In (STDIN), Standard Out (STDOUT), and Standard Error (STDERR) (Figure 12.1).
STDIN contains information that is directed to the input of a script (usually text output via STDOUT from another script). This includes arguments we pass onto the script.
Why do these matter? To work in a Unix pipeline, a script must be able to utilize STDIN, and generate STDOUT, and STDERR.
Specifically, in pipelines, STDOUT of a script (here it’s run_bwa.sh) is directed into STDIN of another command (here wc, or word count)
graph LR
E[run_bwa.sh] --> B(STDOUT)
B --> F{"|"}
E --> C(STDERR)
F --> D("STDIN (wc)")
D --> G[wc]
run_samtools.sh into another command (wc)
We will mostly use STDOUT in our bash scripts, but STDERR can be really helpful in debugging what’s going wrong.
3.12.1 > (redirects)
Sometimes you want to direct the output of a script (STDOUT) to a text file. A lot of bioinformatics tools output to STDOUT, and we need a way to save the results, or pass the results onto another program.
Enter >, a redirect. We can put > after our command followed by a file path to save the output.
samtools view -c my_bam_file.bam > counts.txtHere we are redirecting the output of samtools view into the counts.txt file. Note that everytime that we run the script, we will overwrite the current contents of the counts.txt file. Sometimes that is not what you want.
There is another kind of redirect, >>, that will append (that is, add to the end of a file). If the file does not exist, it will be created. But if it does exist, the output will be added to the end of the file. I rarely use this, however.
Much more information about redirects can be found here: https://www.geeksforgeeks.org/linux-unix/input-output-redirection-in-linux/ and in Bite Sized Bash (page 13).
We’ll use pipes and pipelines not only in starting a bunch of jobs using batch scripting on our home computer, but also when we are processing files within a job.
Pipes are at the heart of multi-stage workflows. They allow us to specify multiple steps in processing a file.
3.12.2 For more info about pipes and pipelines
- Bite Size Bash Page 13 for more about redirects
- Bite Size Bash Page 20 for more about pipes
- https://swcarpentry.github.io/shell-novice/04-pipefilter/index.html
- https://datascienceatthecommandline.com/2e/chapter-2-getting-started.html?q=stdin#combining-command-line-tools
3.13 Arguments in R Scripts
To make our R Script more useful, it must also accept arguments. Let’s explore how to achieve this.
As we learned in (Section 1.15), We can use Rscript to execute our command non-interactively. Rscript, which is the command line executable, will run our Rs cript on the command line.
Note that we have a named argument called input_file and it is done differently than in Bash - how do we use this in our R Script?
3.13.1 Using Named Arguments in an R script
We can pass arguments from our bash script to our R script by using commandArgs() - this will populate a list of named arguments (such as input_file) that are passed into the R Script. We assign the output of commandArgs() into the args object.
We refer to our input_file argument as args$input_file in our script.
scripts/week2/process_data.R
- 1
-
Use
args$input_fileas an input forread.csv() - 2
-
Use
args$input_fileto specify the output name.
3.13.2 Try it out: Running our R Script
Now that we’ve set it up, we can run the R script from the command line as follows:
module load fhR
Rscript process_data.R input_file="LUSC_clinical.csv"
module purge3.13.3 Integrating an R script into a Bash Script
Say you have an R Script you need to run on the command line in combination with other unix commands. In our bash script, we can do the following:
./scripts/week2/wrap_r_script.sh
#!/bin/bash
module load fhR
Rscript process_data.R input_file="${1}"
module purgeThen we can run
bash wrap_r_script.sh .data/PRAD.csv In our bash script, my_bash_script.sh, we’re using a positional argument (for simplicity) to specify our csvfile, and then passing the positional argument to named ones (input_file) for my_r_script.R.
3.13.4 Summary
- R scripts will accept named arguments when called by
Rscript:
Rscript process_csv.R input_file="my_csvfile.csv"3.14 Arguments in Python Scripts
Similarly, we will need Python scripts to accept arguments as well.
3.14.1 Using sys.argv
The sys module (built into Python) will let us access arguments by position:
[argparse]
- 1
-
Import
sysmodule - 2
-
Use
sys.argvto access the first argumentsys.argv[1]
If you know a little bit about Python, you’ll know that arrays in python are zero-indexed, that is, they start at [0], but we have started here at sys.argv[1] - what is at sys.argv[0]?
It turns out that sys.argv[0] is the name of the script. In the above script process_file.py, it contains the string "process_file.py". Having the name of the script can come in handy, especially when outputting things.
3.14.2 Running our Python script on the command line
Now we have set up our python script, we can run it on the command line on rhino:
module load fhPython
python3 process_file.py lusc_file.csv
module purgeAnd we could wrap it up in a bash script:
./scripts/week2/wrap_py_script.sh
#!/bin/bash
module load fhPython
python3 process_file.py ${1}
module purge3.15 Next Time
We’ll work on batch processing, both on a local system and the HPC cluster. Make sure to read the HPC Basic chapter before class.