19 Activity

19.1 Launching Galaxy on AnVIL

Note that, in order to use Galaxy, you must have access to a Terra Workspace with permission to compute (i.e. you must be a “Writer” or “Owner” of the Workspace).

Open your Workspace, and click on the “Environment configuration” button, a cloud icon on the righthand side of the screen.

Screenshot of the Workspace that points to the Environment configuration button, an icon of a cloud with a lightning bolt.

Under Galaxy, click on “Create new Environment”. Click on “Next” and “Create” to keep all settings as-is. This will take 8-10 minutes.

The button that starts a cloud environment for Galaxy has been highlighted,

Click on “Open Galaxy” when the environment is ready.

The Open Galaxy button is highlighted in the ready environment popup.

19.2 Importing Data into Galaxy on AnVIL

When we cloned our workspace, our cloned workspace linked to the original data! We will upload three files from the AnVIL workspace into Galaxy, though we need only one fastq data sequence file for our activity. The others will be used if you want to continue with a related activity that performs alignment and variant discovery after quality control. These three files are (1) the forward and (2) the reverse reads for our sample, as well as, (3) the reference genome for SARS-CoV-2. There are two sets of reads for our sample because the scientists who collected it used paired-end sequencing. The sample files we are looking at end in fastq because they are raw data from the sequencer. The reference genome ends in .fasta because it has already been cleaned up by scientists.

  1. Click on “Upload Data” in the Tools pane.

Screenshot of the Galaxy homepage. The Upload Data link has been highlighted.

  1. Click on “Choose remote files” at the bottom of the popup.

Screenshot of the Galaxy Data upload popup page and upload options.

If you had files locally on your computer that you wanted to upload, you would use the “Choose local file” button.

Or if you had files you wanted to import from a data repository like Zenodo, you would use the “Paste/Fetch data” button.

We’re using the “Choose remote files” button because we have data in our AnVIL workspace that we can import into the Galaxy on AnVIL instance.

  1. Double-click the Workspace folder.

Screenshot of the Galaxy Data upload popup pane, highlighting the AnVIL workspace where your data is linked

  1. Upload the sample sequence data files
  • Double-click “Tables/”

Screenshot of the Galaxy Data upload popup pane, highlighting the Tables/ folder where linked data will be in a cloned workspace

  • Then double-click “sample/”.

Screenshot of the Galaxy Data upload popup pane, highlighting the sample/ folder where the data sequence .fastq files will be found

  • Click the two sample .fastq file checkboxes to select them.

Screenshot of the Galaxy Data upload popup pane, selecting the two data sequence .fastq files

  • These files will be highlighted in green when ready. Click “Ok”.

Screenshot of the Galaxy Data upload popup pane, highlighting the selected data sequence .fastq files and the Ok button.

Expand for Steps 5- 6: Upload the reference genome
  1. Repeat steps 2 and 3 from above.
  • Click on “Choose remote files” at the bottom of the popup.

Screenshot of the Galaxy Data upload popup page and upload options.

  • Double-click the Workspace folder.

Screenshot of the Galaxy Data upload popup pane, highlighting the AnVIL workspace where your data is linked

  1. Upload the reference genome file
  • Again, double click “Tables/”.

Screenshot of the Galaxy Data upload popup pane, highlighting the Tables/ folder where linked data will be in a cloned workspace

  • This time, double click “reference/”.

Screenshot of the Galaxy Data upload popup pane, highlighting the reference/ folder where the reference genome fasta file will be found.

  • Click the fasta file.

Screenshot of the Galaxy Data upload popup pane, highlighting the reference genome .fasta file

  • This file will be highlighted green and click “Ok”.

Screenshot of the Galaxy Data upload popup pane, highlighting the selected reference genome .fasta file and the Ok button.


7. Click “Start”

Screenshot of the Galaxy Data upload popup page. All three files are ready to be imported and the Start button is highlighted.

  1. Once complete, click “Close”.

Screenshot of the Galaxy Data upload popup pane. All trhee files are highlighted green and the Close button is highlighted.

  1. Confirm that the files uploaded successfully by looking at the file names in the Galaxy History pane.

Note that the files will be highlighted in green in the Galaxy History pane once they are uploaded and available.

Screenshot of the Galaxy homepage. The successfully uploaded files are boxed in green color.

19.3 Examining fastq sequence data files

We will examine data in fastq format. This is the typical output from an Illumina Sequencer, but also the standard format output from most sequencers.

  1. Use your mouse and click on the eye icon (eye button image) of the first fastq file (VA_sample_forward_reads.fastq).

Screenshot of the Galaxy homepage. Highlighting the eye icon for the forward reads .fastq file.

  1. After clicking the eye icon, in the Main screen you will see something like this:

Screnshot of a fastq file in the middle panel of Galaxy. The data includes DNA sequences but also includes many coded characters, making it hard to understand.

Expand for FASTQ files explained

For more information on the contents of a FASTQ file, consider this resource from Illumina.

QUESTIONS:

  1. How many lines in a .fastq file represent an individual read?

  2. What does each line represent?

  3. Why is the final line for each read (the quality score) important?

Breakout Box: Learn more about quality scores

To save space, the sequencer records an ASCII character to represent scores 0-42. For example 10 corresponds to “+” and 40 corresponds to “I”. FastQC (a tool we’ll be using next) knows how to translate this. This way of encoding the data is often called “Phred” scoring.

What does 0-42 represent? These numbers, when plugged into a formula, tell us the probability of an error for that base. This is the formula, where Q is our quality score (0-42) and P is the probability of an error:

Q = -10 log10(P)

Using this formula, we can calculate that a quality score of 40 means only 0.00010 probability of an error!

19.4 Finding and Using FastQC

FastQC is a tool which aims to provide simple quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a set of analyses which you can use to get a quick impression of whether your data has any problems of which you should be aware before doing any further analysis.

  1. Use the search tools bar in the upper left (within the tools pane in Galaxy)

Screenshot of Galaxy on AnVIL with an arrow pointing to the tool search bar in the upper left corner. THe tools icon is also highlighted on the far left in case you have to first navigate there to see the search bar.

  1. Type fast to search for FastQC, and select the tool in the list below.

Highlighting searching part of hte FastQC tool's name in the search bar and then selecting FastQC from the list below.

  1. This will open the tool menu in the middle pane.

Screenshot of Galaxy on AnVIL showing the FastQC tool menu has been opened in the middle pane.

  1. Switch the version of FastQC to 0.73+galaxy0
  • Select the Versions icon (3 cubes).

Screenshot of Galaxy on AnVIL highlighting the versions icon in the gray banner at the top of the middle pane

  • Select “Switch to 0.73+galaxy0” from the dropdown menu.

Screenshot of Galaxy on AnVIL highlighting which version of FastQC to switch to in the dropdown menu for tool versions.

  • Confirm that the version now says “Selected 0.73+galaxy0”.

Screenshot of Galaxy on AnVIL highlighting what the version dropdown menu should display.

  1. Confirm or select the correct input for FastQC (the forward reads fastq file).

Screenshot of Galaxy on AnVIL highlighting the input for the tool, specifically selecting the forward reads fastq file for the 'Raw read data from your current history'

  1. Run FastQC by clicking the blue “Run Tool” button.

Screenshot of Galaxy on AnVIL highlighting the blue 'Run Tool' button in the top right corner of the middle pane.

  1. After submitting the job to run, the middle pane should have a message highlighted in green.

Screenshot of Galaxy on AnVIL showing how the middle pane and history pane look after submitting a job before it successfully runs.

The history pane should also list what will be the output(s) from the tool. Note, before the job has finished running, these output(s) will be highlighted in gray. While running, the output(s) will be highlighted in an orange cream color. And once the tool runs successfully, the output(s) will be highlighted in green.

Screenshot of Galaxy on AnVIL showing the outputs highlighted in green within the history pane.

19.5 Examining the FastQC quality control summary report

We will examine the FastQC output in webpage or html format. This form of the output provides graphs and a flag of “Passed”, “Warn”, or “Fail” for each subsection within the quality control analysis.

  1. Use your mouse and click on the eye icon (eye button image) of the FastQC Webpage output (FastQC on data 1: Webpage).

Screenshot of the Galaxy homepage. Highlighting the eye icon for the forward reads .fastq file.

  1. This will open up a summary report for the sequencing file in the middle pane that you can scroll.

Screenshot of an example summary report from FastQC

Expand for FastQC summary report explained

For more information on the contents of the output quality control summary report from FastQC, consider this resource from Michigan State

QUESTIONS:

  1. Explore “Basic Statistics”. How many total reads are there? Have any been flagged as poor quality? What is the sequence length?

  2. Explore “Per base sequence quality”. Based on the Basic Statistics, is 28-40 a good or bad quality score?

  3. Is it okay to proceed based on the per base sequence quality?

19.6 Exporting your results

In case you want to view the results later, you can download the file.

  1. Click on the name of the results you want to export/save, and it will expand the info shown for that file. Click the floppy disk/save icon.

Screenshot of Galaxy on AnVIL highlighting the floppy disk/save icon for the results that you want to export.

This will download a zip (compressed) file with the results to your computer. Uncompress it to view it locally.

19.7 Shutting down Galaxy on AnVIL

Once you are done with your activity, you’ll need to shut down your Galaxy cloud environment. This frees up the cloud resources for others and minimizes computing cost. The following steps will delete your work, so make sure you are completely finished at this point. Otherwise, you will have to repeat your work from the previous steps.

Return to AnVIL, and find the Galaxy logo that shows your cloud environment is running. Click on this logo.

Screenshot of the Workspace menu. The currently running Galaxy cloud environment logo on the right sidebar is highlighted.

Next, click on “Settings”. Click on “Delete Environment”.

Screenshot of the cloud environment pop out menu. The "Delete Environment" button is highlighted.

Finally, select “Delete everything, including persistent disk”. Make sure you are done with the activity and then click “Delete”.

Screenshot of the cloud environment pop out menu. The “Delete everything, including persistent disk” radio button has been checked and is highlighted. The “Delete” button is highlighted.