2  Week 1: Navigating

2.1 Learning Objectives

By the end of this session, you should be able to:

  • Navigate and copy data to the different filesystems available at Fred Hutch.
  • Explain the difference between absolute and relative file paths.
  • Set Permissions on and execute a bash script
  • Execute scripts written in Python and R on the command line
  • Find help on the system and on the web

2.2 Exercises

Open up the exercises here or in Google Classroom.

NoteReminder about Terminology

Defined words are double underlined. You can click and hold on them to see the definition. Try it below!

2.4 Setting Yourself Up for Success

Make Sure you:

I will demo how to connect to rhino using the Scicomp On Demand dashboard. This site has a handy “Rhino Shell Access” menu item under “Clusters”.

When you are scripting, I suggest you open two terminal windows: the first one is for editing scripts, and the second one is for running scripts on the command line.

So now we have logged into rhino. Now what?

2.5 Grabbing Stuff from GitHub

For the rest of the exercises for today, we’ll be grabbing the scripts from github using git clone.

git clone https://github.com/fhdsl/bash_for_bio

This will create a folder called bash_for_bio/ in our current directory. This directory has all of the course materials, including the scripts.

ImportantStay in bash_for_bio/

Throughout this course, I expect you to run code in the base bash_for_bio/ folder, not in scripts or in data. All of the code is tested with this in mind.

If you are having problems executing the code, please make sure you are in the base bash_for_bio folder, or adjust your file paths when running the script.

2.5.1 du: How much space?

One of the first things we can do is check for disk usage with the du command. If I run du by itself on the command line, it will give me the disk usage of all folders and files in our current directory, which is a lot of output.

There is an option called -d that lets us specify the depth. -d 1 will give us only the file sizes of the top level folders in our directory.

Make sure you are in the bash_for_bio/ directory. Then try the following command:

du -d 1 -h .

Here are the first few lines of my du output within the bash_for_bio folder:

240K    ./_extensions
192K    ./.quarto
616K    ./scripts
1.9M    ./data
8.6M    ./.git
6.7M    ./docs
10M ./images
30M .

If we want to specify du to scan only a single folder, we can give the folder name.

du -d 1 scripts

And I will get the following output:

144K    scripts/week1
56K scripts/__pycache__
128K    scripts/week3
232K    scripts/week2
616K    scripts
NoneTry it out

Try checking the disk usage using du for the bash_for_bio/ folder in your /home directory (mine is /home/tladera2/bash_for_bio/).

du -d 1 bash_for_bio/

Try out using du -d 2 on your home directory:

du -d 2 ~/

2.6 FH users: the main filesystems

When working on the Fred Hutch HPC, there are four main filesystems you should consider:

  • /home/ - The home filesystem. Your scripts can live here. Also where your configuration files (such as .bashrc) live. Can be accessed using ~/.
  • /fh/fast/ (also known as fast) - Research storage. Raw files and processed results should live here.
  • /hpc/temp/ (also known as temp) - The temporary filesystem. This filesystem is faster to access for gizmo nodes on the cluster, so files can be copied to for computation. The output files you generate should be moved back into an appropriate folder on /fh/fast/. Note that files on /hpc/temp/ will be deleted after 30 days.
  • /fh/regulated/ - A secure filesystem meant for NIH regulated data. If you are processing data that is regulated under the current NIH guidelines, you will process it here.

So, how do we utilize these filesystems? We will be running commands like this:

1ml BWA
2bwa mem -M -t 2 \
3  /fh/fast/reference_data/chr20 \
4  /fh/fast/laderas_t/raw_data/na12878_1.fq \
  /fh/fast/laderas_t/raw_data/na12878_2.fq > \
5  /hpc/temp/laderas_t/aligned_data/na12878_1.sam
1
Load bwa software
2
Start bwa mem (aligner)
3
path of genome index
4
path of paired end reads files
5
path of output

To understand the above, We first have to familiarize ourselves with absolute vs relative paths.

NoneWhen you need to span multiple lines: \

Sometimes it’s hard to read code that is a single line. You can break up a very long line of code using the \ (backslash) character.

For example, instead of:

bwa mem -M -t 2 /fh/fast/reference_data/chr20 /fh/fast/laderas_t/raw_data/na12878_1.fq /fh/fast/laderas_t/raw_data/na12878_2.fq > /hpc/temp/laderas_t/aligned_data/na12878_1.sam  

We can rewrite it as:

ml BWA                                                
bwa mem -M -t 2 \                                     
1  /fh/fast/reference_data/chr20 \
2  /fh/fast/laderas_t/raw_data/na12878_1.fq \
3  /fh/fast/laderas_t/raw_data/na12878_2.fq > \
4  /hpc/temp/laderas_t/aligned_data/na12878_1.sam
1
Path of reference genome
2
1st paired end read FASTQ file
3
2nd paired end read FASTQ file
4
Output file location

We’ll use this throughout the book so the code is easier to read.

2.6.1 More about the FH Filesystems

https://sciwiki.fredhutch.org/scicomputing/store_posix/

NoneEven if you don’t have execute permissions

With bash scripts, you can still run them if you have read permissions. You can still run bash scripts by using the bash command:

bash run_samtools.sh my_bam_file.bam

This is also the case for scripts that use she-bangs (Section 3.4) for R or Python or any other executable.

2.6.2 Try it out

What are the permissions for the GitHub repo (bash_for_bio) that you just downloaded?

2.7 Running a Bash Script

Ok, now we have a bash script tell_the_time.sh in our current directory, how do we run it?

Because the script is not on our $PATH (Section 15.3.2), we’ll need to use ./ to execute it. ./ is an alias for the current folder, and it is an indicator to bash that the command we want to execute is in our current folder.

tladera2$ ./tell_the_time.sh

If we haven’t set the permissions (Section 1.4) correctly, we’ll get this message:

bash: ./scripts/tell_the_time.sh: Permission denied

But if we have execute access, we’ll get something like this:

Fri Jul 11 13:27:47 PDT 2025

Which is the current date and time.

2.8 Running an R or Python Script on the command line

2.8.1 Loading the fhR or fhPython modules

Before we can run our scripts in R or Python, we’ll need to load up either R or Python on the cluster. We can do this with the module load command:

1module load fhR
2module load fhPython
1
Load up fhR module - has R and most packages installed
2
Load up fhPython module - has Python and most packages installed.

We’ll talk more about software modules next week (Section 3.5).

2.8.2 R Users

You might not be aware that there are multiple ways to run R:

  1. as an interactive console, which is what we usually use in an IDE such as RStudio
  2. on the command line using the Rscript command.
Rscript my_r_script.R

To run this script, we’ll need to first load fhR:

module load fhR
Rscript my_r_script.R
module purge

2.8.3 Python Users

Python users are much more aware that you can run Python scripts on the command line:

python3 my_python_script.py

To execute this on gizmo, we’ll first need to load fhPython:

module load fhPython
python3 my_python_script.py
module purge

Within a shell script, you can also use a shebang (Section 3.4) to make your script executable by providing the location of python3:

#!/bin/python3
python3 my_python_script.py

2.9 Getting Help

You may have heard about man pages. You can usually get help by using the man command:

man wc

This is the first part of the output:

NAME
     wc – word, line, character, and byte count

SYNOPSIS
     wc [--libxo] [-Lclmw] [file ...]

DESCRIPTION
     The wc utility displays the number of lines, words, and bytes contained in each input file, or standard input (if no file is
     specified) to the standard output.  A line is defined as a string of characters delimited by a ⟨newline⟩ character.
     Characters beyond the final ⟨newline⟩ character will not be included in the line count.

     A word is defined as a string of characters delimited by white space characters.  White space characters are the set of
     characters for which the iswspace(3) function returns true.  If more than one input file is specified, a line of cumulative
     counts for all the files is displayed on a separate line after the output for the last file.

     The following options are available:

         --libxo
             Generate output via libxo(3) in a selection of different human and machine readable formats.  See xo_parse_args(3)
             for details on command line arguments.

     -L      Write the length of the line containing the most bytes (default) or characters (when -m is provided) to standard
             output.  When more than one file argument is specified, the longest input line of all files is reported as the value
             of the final “total”.

I personally find man pages very hard to read, especially when there are lots of options for a command.

Instead, I use tldr, which contain examples of the most commonly used options in a command. It is not installed on gizmo, but you can use the page at https://tldr.inbrowser.app/, which has all of the tldr help pages.

2.10 Recap

We learned the following this week:

  • Navigate and copy data to the different filesystems available at Fred Hutch.
  • Set Permissions on and execute a bash script
  • Find help on the system