scripts/week2/process_data.R
- 1
-
Use
args$CSVFILE
as an input forread.csv()
- 2
-
Use
args$CSVFILE
to specify the output name.
By the end of this session, you should be able to understand and run this shell script.
1#!/bin/bash
2module load SAMtools/1.19.2-GCC-13.2.0
3samtools view -c $1 > $1.counts.txt
4module purge
It seems a little intimidating, but we will take this apart line by line.
There is a rule in programming: if you do something more than 3 times, you should consider making it into a script or function.
For example, imagine that you use samtools view -c
all the time with certain options and you want to save the output. You can put this command and options into a shell script that takes named files as an argument (such as samcount.sh
. Instead of typing samtools stat
over and over again, you can run
./samcount.sh my_file.bam
and it will output a file my_file.bam.counts.txt
that has the counts we are interested in.
What’s this first line?
#!/bin/bash
the #!
is known as a she-bang - it’s a signal to Linux what language interpreter to use when running the script on the command line. In our case, we want to use bash
. But we can also make python scripts executable by putting the path to python:
#!/
The she-bang is necessary if you want to run the script without using the bash
command (after you have made it executable):
./samcount.sh chr1.sam
Note that you can always use bash
to execute a file if you don’t have execute access:
bash samcount.sh chr1.sam
Ok, we’ve gotten comfortable navigating around the HPC filesystem. Now how do we run executables on files?
Let’s talk about the two problems:
Say we want to see if samtools
is installed on our HPC. One of the key commands you can use to find software is the which
command. If your software is installed, which
will give you the path where the software is installed. For example, I can see if bash
is installed:
which bash
Which gives me the response:
/bin/bash
So, let’s see if samtools
is installed:
which samtools
Which gives no response, so where is samtools
?
If we don’t have samtools
immediately available, how do we find it on our system? On the HPC system, We can use environment modules to load software.
Before you install your own versions of software, it’s important to realize that this problem may be solved for you.
Your first stop should be looking for environment modules on the HPC. Not all HPCs have these, but if they have them, this should be your first stop to find executables.
lmod
is a system for loading and unloading software modules. It is usually installed on HPCs. The commands all start with module
, and there are a number of ones that are useful for you.
module avail
module load
module purge
If you want to see the current list of available modules and their names, check them out here.
Looking for samtools
on that page, we discovered the name of our module:
SAMtools
So, that’s what we’ll use to load up samtools
.
module load
Here’s the next line of the script:
module load SAMtools/1.19.2-GCC-13.2.0 #load the module
Our module name is SAMtools
, and the 1.19.2-GCC-13.2.0
after it is the version of that module. In general, it is better to specify a version when running software, because there may be breaking changes in newer versions of the software.
If we do not specify the version number, then module load
will load the default (notated by a (D)
) version of the module.
If there is a particular bit of software that you need to run on the FH cluster that’s not there, make sure to request it from SciComp. Someone else probably needs it and so making it known so they can add it as a Environment module will help other people.
One of the big issues with bioinformatics software is that the toolchain (the software dependencies needed to run the software) can be different for different executables. So when possible, load only one or two modules at a time for each step of your analysis. When you’re done with that step, use module purge
to clear out the software environment.
Please refer to https://sciwiki.fredhutch.org/scicomputing/compute_environments/ for more information about the available modules.
$1
: A Positional argumentThe next line of our script is this:
samtools view -c $1 > $1.counts.txt
Let’s take a look at the command that we’re running first. We’re going to run samtools view -c
, which will give us counts on an incoming bam
or sam
file and save it in a file. We want to be able to run our script like this:
bash samtools_count.sh my_file.bam
When we run it like that, samtools_count.sh
will run samtools view -c
like this:
samtools view -c my_file.bam > my_file.bam.counts.txt
So what’s going on here is that there is some substitution using common arguments. Let’s look at these.
$1
How did the script know where to substitute each of our arguments? It has to do with the argument variables. Arguments (terms that follow our command) are indexed starting with the number 1. We can access the value at the first position using the special variable $1
.
Note that this works even in quotes.
So, to unpack our script, we are substituting our first argument for the $1
.
When we run this script, it will output into our current directory, which is the top level of the bash_for_bio
folder. We can change the behavior of the script by doing the following:
samtools view -c $1 > $2
With the added complication that we need to specify a path for our counts file.
How would we rewrite sam_run.sh
(shown below) if we wanted to specify the output file as the first argument and the bam file as the second argument?
#!/bin/bash/
samtools view -c $1 > $2
For this script, we would switch the positions of $1
and $2
.
#!/bin/bash/
samtools stats $2 > $1
Please refer to Bite Size Bash page 9 for more info.
module purge
The last line of our script is:
module purge
This line will unload the modules from memory. It’s good practice to unload modules when you’re done with them, especially since they have complex chains of dependencies, and the versions of these dependencies can interfere with other packages.
We saw a little bit about using $1
, which is a variable in our Bash scripts. Let’s talk about declaring variables in bash scripts and using them using variable expansion.
In Bash, we can declare a variable by using <variable_name>=<value>
. Note there are no spaces between the variable (my_variable
), equals sign, and the value ("ggplot2"
).
my_variable="ggplot2"
echo "My favorite R package is ${my_variable}"
My favorite R package is ggplot2
Take a look at line 3 above. We expand the variable (that is, we substitute the actual variable) by using ${my_variable}
in our echo
statement.
In general, when expanding a variable in a quoted string, it is better to use ${my_variable}
(the variable name in curly brackets). This is especially important when using the variable name as part of a string:
my_var="chr1"
echo "${my_var}_1.vcf.gz"
chr1_1.vcf.gz
If we didn’t use the braces here, like this:
echo "$my_var_1.vcf.gz"
Bash would look for the variable $my_var_1
, which doesn’t exist. So use the curly braces {}
when you expand variables. It’s safer overall.
There is an alternate method for variable expansion which we will use when we call a sub-shell - a shell within a shell. We need to use parentheses ()
to expand them within the sub-shell, but not the top-shell. We’ll use this when we process multiple files within a single worker.
Please refer to Bite Size Bash page 7 for more information.
We can run the script on a file:
./scripts/samcount.sh data/my_sam_file.sam
How about running the bwa-mem
aligner? Let’s run a FASTQ file into it:
#!/bin/bash
module load BWA/0.7.17-GCCcore-11.2.0
REF_GENOME="/shared/biodata/reference/iGenomes/Homo_sapiens/UCSC/hg19/Sequence/BWAIndex/genome.fa"
bwa mem ${REF_GENOME} $1 > $1.aligned.sam
module purge
To make our R Script more useful, it must also accept arguments. Let’s explore how to achieve this.
As we learned in (Section 1.14), We can use Rscript
to execute our command non-interactively. Rscript
, which is the command line executable, will run our Rs cript on the command line.
Note that we have a named argument called input_file
and it is done differently than in Bash - how do we use this in our R Script?
We can pass arguments from our bash script to our R script by using commandArgs()
- this will populate a list of named arguments (such as CSVFILE
) that are passed into the R Script. We assign the output of commandArgs()
into the args
object.
We refer to our CSVFILE
argument as args$CSVFILE
in our script.
scripts/week2/process_data.R
args$CSVFILE
as an input for read.csv()
args$CSVFILE
to specify the output name.
Now that we’ve set it up, we can run the R script from the command line as follows:
Rscript process_data.R CSVFILE="LUSC_clinical.csv"
Say you have an R Script you need to run on the command line in combination with other unix commands. In our bash script, we can do the following:
scripting-basics/wrap_r_script.sh
#!/bin/bash
module load fhR
Rscript process_data.R CSVFILE="${1}"
module purge
Then we can run
bash wrap_r_script.sh my_csvfile.csv
In our bash script, my_bash_script.sh
, we’re using positional argument (for simplicity) to specify our csvfile, and then passing the positional argument to named ones (CSVFILE
) for my_r_script.R
.
Rscript
:Rscript process_csv.R CSVFILE="my_csvfile.csv"
Similarly, we will need Python scripts to accept arguments as well.
sys.argv
The sys
module (built into Python) will let us access arguments by position:
Now we have set up our python script, we can run it on the command line on rhino
:
module load fhPython
python3 process_file.py lusc_file.csv
And we could wrap it up in a bash script:
wrap_py_script.sh
#!/bin/bash
module load fhPython
python3 process_file.py ${1}
module purge
We will need to use pipes to chain our commands together. Specifically, we need to take a command that generates a list of files on the cluster shared filesystem, and then spawns individual jobs to process each file. For this reason, understanding a little bit more about how pipes (|
) work in Bash is helpful.
If we want to understand how to chain our scripts together into a pipeline, it is helpful to know about the different streams that are available to the utilities.
graph LR A(STDIN) --> E[run_samtools.sh] E --> B(STDOUT) E --> C(STDERR)
Every script has three streams available to it: Standard In (STDIN), Standard Out (STDOUT), and Standard Error (STDERR) (Figure 9.1).
STDIN contains information that is directed to the input of a script (usually text output via STDOUT from another script). This includes arguments we pass onto the script.
Why do these matter? To work in a Unix pipeline, a script must be able to utilize STDIN, and generate STDOUT, and STDERR.
Specifically, in pipelines, STDOUT of a script (here it’s run_samtools
) is directed into STDIN of another command (here wc
, or word count)
graph LR E[run_samtools.sh] --> B(STDOUT) B --> F{"|"} E --> C(STDERR) F --> D("STDIN (wc)") D --> G[wc]
run_samtools.sh
into another command (wc
)
We will mostly use STDOUT in our bash scripts, but STDERR can be really helpful in debugging what’s going wrong.
>
(redirects)Sometimes you want to direct the output of a script to a text file. A lot of bioinformatics tools output to standard out, and we need a way to save the results.
Enter >
, a redirect. We can put >
after our command followed by a file path to save the output.
samtools view -c my_bam_file.bam > counts.txt
Here we are redirecting the output of samtools view
into the counts.txt
file. Note that everytime that we run the script, we will overwrite the current contents of the counts.txt
file. Sometimes that is not what you want.
There is another kind of redirect, >>
, that will append (that is, add to the end of a file). If the file does not exist, it will be created. But if it does exist, the output will be added to the end of the file.
Much more information about redirects can be found here: https://www.geeksforgeeks.org/linux-unix/input-output-redirection-in-linux/ and in Bite Sized Bash.
We’ll use pipes and pipelines not only in starting a bunch of jobs using batch scripting on our home computer, but also when we are processing files within a job.
Pipes are at the heart of multi-stage workflows. They allow us to specify multiple steps in processing a file.