
19 Exercises
19.1 Explore MAGE Workspace
The demos-explore-mage Workspace allows you to explore how data is organized on AnVIL using a subset of data from the Multi-ancestry Analysis of Gene Expression (MAGE) and 1000 Genomes Project (1KGP) projects. Importantly, pay close attention to the relationship between file links in Data Tables and the actual files in Workspace Buckets, especially after cloning a Workspace. You can navigate to MAGE Workspace directly through this link:
Key components of this Workspace are:
- DASHBOARD tab - which will tell you more about the Workspace.
- DATA tab - which has Data Tables with links to data relevant for demo exercises.
- ANALYSES tab - which has notebooks relevant for demo exercises with Jupyter and RStudio.
- ‘Environment Configuration’ button - to select the Cloud Environment of choice.
- ‘Browse workspace files’ button - which contains any files stored in this Workspace (see Review Key Concepts for important note about Cloned Workspaces) as well as notebooks listed in the ANALYSES tab.
19.1.2 Scavenger Hunt
In this exercise you will go on a scavenger hunt to explore the Original Workspace and your Cloned Workspace to look for and understand the differences between them.
The focus of this scavenger hunt within the Workspaces will be on:
- Entries in Data Tables
- Files in Workspace Buckets
Original Workspace
- https://anvil.terra.bio/#workspaces/anvil-outreach/demos-explore-mage
- Explore Data Tables e.g. counts.csv, PCA.nb.html
- Find files in Workspace Bucket
Your Cloned Workspace
- https://anvil.terra.bio/#workspaces/ <billing-project> / <workspace-name>
- Explore Data Tables
- Look for files in Workspace Bucket
19.2 Analysis with Jupyter/Terminal
The Jupyter Cloud Environment allows for both interactive analysis using Jupyter Notebooks as well as a UNIX Terminal. Pre-configured with conda, Python, R, and GATK, you can further personalize your environment using startup scripts and custom Docker images. In this Workspace you will find a Jupyter Notebook demonstrating how to generate a quick summary of 1000 Genomes variants. In this exercise you will learn how to:
- Launch Jupyter Cloud Environment in your Cloned Workspace.
- Access a UNIX Terminal.
- Transfer data using between Workspace Bucket and Persistent Disk using
gcloud storage
- Run a Jupyter Notebook to plot a histogram of 1000 Genomes variants per chromosome.
19.3 Bioconductor with RStudio
The next Cloud Environment we will explore is RStudio which comes pre-configured with support for both the recent and prior release of Bioconductor. In 2024, the Bioconductor project provided almost 4,000 packages covering a “broad range of powerful statistical and graphical methods for the analysis of genomic data”. AnVIL enables analysis using Bioconductor in a single secure environment with direct access to results from additional tools like Galaxy and WDL Workflows. In this exercise you will learn how to:
- Launch RStudio Cloud Environment in your Cloned Workspace.
- Transfer data between your Workspace and Persistent Disk using the Bioconductor AnVILGCP package.
- Run an R Notebook to make a PCA plot of MAGE RNA expression counts.
19.3.3 Use AnVILGCP
Copy a file to your Persistent Disk using the link in the Data Table and the Bioconductor AnVILGCP package
BiocManager::install( "AnVILGCP" )
library( "AnVILGCP" )
avtable( "MAGE" )
gcloud_storage( paste( "cp", avtable( "MAGE" )[2,2], "." ) )
19.3.4 Run R Notebook
Two notebooks are available showcasing notable features of Bioconductor and RStudio on AnVIL
- PCA.Rmd – Make a PCA plot of gene expression counts from MAGE, demonstrating how AnVILGCP facilitates access to Data Tables (
avtable()
) and Workspace Buckets (gcloud_storage()
) - eQTL.Rmd (requires using the
install-conda.sh
startup script) – Visualize an eQTL (expression quantitative trait locus) correlated with expression of GSTP1, combining datasets from the MAGE and 1KGP Workspaces
19.4 Workflows with Galaxy
Galaxy is a free, open-source web-based platform for data intensive biomedical research. Through its graphical-user interface, over 10,000 tools are ready to run without the need for software installation or prior coding experience. In addition to providing a secure space for analysis and data sharing, Galaxy on AnVIL automatically grants you system administrator privileges, enabling you to install and soon configure any tool in the toolshed.
In this exercise you will learn how to:
- Launch Galaxy Environment in your Cloned Workspace.
- Upload and export data from/to your Workspace Bucket.
- Run a tool.
- Install a tool.
19.4.3 Run FastQC
Copy a file to your Persistent Disk using the Galaxy Upload tool
- Click “Upload” in the left hand menu
- Click “Choose remote files” at the bottom
- Select the
demos-explore-mage
Workspace - Select
Tables
and then1KGP
- Tick the box next to
NA12878.1mil.fastq.gz
and click “Ok” - Click “Start” and then “Close”
You can now
- Run the FastQC tool and view FastQC on data 1: Webpage
- Export data back to AnVIL using “Export History to File” (must target
Other Data/Files
)