23 Exercises

This hands-on activity will help you explore MAGE, an open-access RNA sequencing dataset of lymphoblastoid cell lines from 731 individuals from the 1000 Genomes Project. As part of this exploration, we will attempt to recreate various figures from this paper.

The GitHub repository can be found here.

Processed data can be found in Zenodo here or in Dropbox here.

To follow along with these exercises, you will need to complete the steps described in the Preparation guide for this demo.

Reminder: The following steps are found in the detailed notebook Reproducibility_in_Action.Rmd in your workspace. To set up your workspace, read through the steps here.

23.0.1 Environment Setup

We will load the three required R packages (tidyverse, vcfR, AnVILGCP) and introduce how R packages work, including the distinction between installation and loading.

23.0.2 Recreating Figure 5C: GSTP1 Expression by Genotype

  1. Importing Expression Counts

We will copy a pre-loaded expression counts CSV from the workspace bucket using AnVILGCP functions, then read it into R.

  1. Reshaping Expression Data

The wide-format counts matrix is transposed into tidy format using tidyverse operations, placing samples as rows and genes as columns.

  1. Importing Metadata

Sample metadata is fetched from the bucket and read into R, providing population labels and other per-sample information.

  1. Joining Counts and Metadata

The reshaped counts table and metadata are merged into a single object using an inner join on shared sample identifiers.

  1. Importing Reference Annotations

The GENCODE v48 GTF annotation file is downloaded and parsed to identify the Ensembl ID for the gene of interest (GSTP1).

  1. Extracting Variant Data

We will retrieve chromosome 11 variant calls from the 1000 Genomes high-coverage dataset, index the VCF, and use bcftools to subset the file to the specific SNP of interest (rs115070172, position chr11:67,559,635). The VCF is then read into R using vcfR.

  1. Combining Variants and Expression

Genotype calls are joined to GSTP1 expression data, phased alleles are collapsed into unphased genotype groups, and per-genotype sample counts are computed and appended as plot labels.

  1. Visualization

A combined violin and boxplot is produced using ggplot2, displaying log2-normalized GSTP1 counts stratified by rs115070172 genotype, closely matching the published figure.

23.0.3 Recreating Figure 5D: GSTP1 Expression by Population

Building on the joined dataset, we will create a new column distinguishing Peruvian (PEL) from non-PEL samples using a conditional mutation, then produce an analogous violin/boxplot stratified by population label.

23.0.4 Independent Extension: A Second eQTL

You will independently apply the full workflow to a second eQTL (rs7927381 × GSTP1), locating the SNP, subsetting the VCF, joining with expression data, and generating a comparable plot.

23.0.5 Additional Exploration

Several optional prompts are provided to you, including faceting plots by population or continental group, visualizing genotype frequency distributions, comparing GSTP1 expression by sex, and replicating Figure 1B using principal component data.

At the end of the exercise, remember to shut down compute! Refer to steps here.

23.1 Learn More

23.2 Provide Feedback

Fill out this poll to share your feedback