23 Exercises
This hands-on activity will help you explore MAGE, an open-access RNA sequencing dataset of lymphoblastoid cell lines from 731 individuals from the 1000 Genomes Project. As part of this exploration, we will attempt to recreate various figures from this paper.
The GitHub repository can be found here.
Processed data can be found in Zenodo here or in Dropbox here.
To follow along with these exercises, you will need to complete the steps described in the Preparation guide for this demo.
Reminder: The following steps are found in the detailed notebook Reproducibility_in_Action.Rmd in your workspace. To set up your workspace, read through the steps here.
23.0.1 Environment Setup
We will load the three required R packages (tidyverse, vcfR, AnVILGCP) and introduce how R packages work, including the distinction between installation and loading.
23.0.2 Recreating Figure 5C: GSTP1 Expression by Genotype
- Importing Expression Counts
We will copy a pre-loaded expression counts CSV from the workspace bucket using AnVILGCP functions, then read it into R.
- Reshaping Expression Data
The wide-format counts matrix is transposed into tidy format using tidyverse operations, placing samples as rows and genes as columns.
- Importing Metadata
Sample metadata is fetched from the bucket and read into R, providing population labels and other per-sample information.
- Joining Counts and Metadata
The reshaped counts table and metadata are merged into a single object using an inner join on shared sample identifiers.
- Importing Reference Annotations
The GENCODE v48 GTF annotation file is downloaded and parsed to identify the Ensembl ID for the gene of interest (GSTP1).
- Extracting Variant Data
We will retrieve chromosome 11 variant calls from the 1000 Genomes high-coverage dataset, index the VCF, and use bcftools to subset the file to the specific SNP of interest (rs115070172, position chr11:67,559,635). The VCF is then read into R using vcfR.
- Combining Variants and Expression
Genotype calls are joined to GSTP1 expression data, phased alleles are collapsed into unphased genotype groups, and per-genotype sample counts are computed and appended as plot labels.
- Visualization
A combined violin and boxplot is produced using ggplot2, displaying log2-normalized GSTP1 counts stratified by rs115070172 genotype, closely matching the published figure.
23.0.3 Recreating Figure 5D: GSTP1 Expression by Population
Building on the joined dataset, we will create a new column distinguishing Peruvian (PEL) from non-PEL samples using a conditional mutation, then produce an analogous violin/boxplot stratified by population label.
23.0.4 Independent Extension: A Second eQTL
You will independently apply the full workflow to a second eQTL (rs7927381 × GSTP1), locating the SNP, subsetting the VCF, joining with expression data, and generating a comparable plot.
23.0.5 Additional Exploration
Several optional prompts are provided to you, including faceting plots by population or continental group, visualizing genotype frequency distributions, comparing GSTP1 expression by sex, and replicating Figure 1B using principal component data.
At the end of the exercise, remember to shut down compute! Refer to steps here.


