Chapter 7 Microarray Data

This chapter is in a beta stage. If you wish to contribute, please go to this form or our GitHub page.

7.1 Learning Objectives

This chapter will demonstrate how to: Understand the very general basics of microarray data collection and processing workflow. Understand the limitations and strengths of microarray data in general.

7.2 Summary of microarrays

Microarrays have been in use since before high throughput sequencing methods became more affordable and widespread, but they still can be a effective and affordable tool for genomic assays. Depending on your goals, microarray may be a suitable choice for your genomic study.

7.3 How do microarrays work?

All microarrays work on hybridization to sets of oligonucleotides on a chip. However, the preparation of the samples, and the oligonucleotides’ hybridization targets vary depending on the assay and goals.

On a basic principle, oligonucleotide probes are designed for different targets sets designed for the same targets are put together. On the whole chip, these probes are arranged in a grid like design so that after a sample is hybridized to them, you can detect how much of the target is detected by taking an image and knowing what target each location is designed to.

7.3.1 Pros:

Microarrays are much more affordable than high throughput sequencing which can allow you to run more samples and have more statistical power (Tarca, Romero, and Draghici 2006; ALSF 2019).
Microarrays take less time to process than most high throughput sequencing methods(Tarca, Romero, and Draghici 2006; ALSF 2019).
Microarrays are generally less computationally intensive to process and you can get your results more quickly(Tarca, Romero, and Draghici 2006; ALSF 2019).
Microarrays are generally as good as sequencing methods for detecting clinical endpoints (W. Zhang et al. 2015).

7.3.2 Cons:

Microarray chips can only measure the targets they are designed for, and cannot be used for exploratory purposes (W. Zhang et al. 2015).
Microarrays’ probe designs can only be as up to date as the genome they were designed against at the time (Mantione et al. 2014; ALSF 2019).
Microarray does not escape oligonucleotide biases like GC content and sequence composition biases(ALSF 2019).

7.4 What types of arrays are there?

7.4.1 SNP arrays

Single nucleotide polymorphism arrays are designed to measure DNA variants. They are designed to target DNA variants. When the sample is hybridized, the amount of fluorescence detected can be interpreted to indicate the presence of the variant and whether the variant is homogeneous or heterogenous. The samples prepped for SNP arrays then need to be DNA samples.

7.4.1.1 Examples:

The 1000 genomes project is a large collection of SNP data array from many populations around the world and is available for download.

7.4.2 Gene expression arrays

Gene expression arrays are designed to measure gene expression. They are designed to target and measure relative transcript abundance level.

7.4.2.1 Examples:

refine.bio is the largest collection of publicly available, already normalized gene expression data (including gene expression microarrays).
Getting started in gene expression microarray analysis (Slonim and Yanai 2009).
Microarray and its applications (2012).
Analysis of microarray experiments of gene expression profiling (Tarca, Romero, and Draghici 2006).

7.4.3 DNA methylation arrays

DNA methylation can also be measured by microarray. To detect methylated cytosines (5mC), DNA samples are prepped using bisulfite conversion. This converts unmethylated cytosines into uracils and leaves methylated cytosines untouched. Probes are then designed to bind to either the uracil or the cytosine, representing the unmethylated and methylated cytosines respectively.

A ratio of the fluorescence signal can be used to identify the relative abundance of the methylated and unmethylated versions of the sequence.

Additionally, 5-hydroxymethylated cytosines (5hmC) can also be detected by oxidative bisulfite bisulfite sequencing (Booth et al. 2013). Note that bisulfite conversion alone will not distinguish between 5mC and 5hmC though these often may indicate different biological mechanics.

7.5 General processing of microarray data

After scanning, microarray data starts as an image that needs to be quantified, normalized and further corrected and edited based on the most current genome and probe annotation.

As noted above, microarrays do not escape the base sequence biases that accompany most all genomic assays. The normalization methods you use ideally will mitigate these sequence biases and also make sure to remove probes that may be outdated or bind to multiple places on the genome.

The tools and methods by which you normalize and correct the microarray data will be dependent not only on the type of microarray assay you are performing (gene expression, SNP, methylation), but most of all what kind of microarray chip design/platform you are using.

7.5.1 Examples

7.5.2 Microarray Platforms

There are so many microarray chip designs out there designed to target different things. Three of the largest commercial manufacturers have ready to use microarrays you can purchase. You can also design microarrays to hit your own targets of interest.

Here are full lists of platforms that have been published on Gene Expression Omnibus.

7.6 Very General Microarray Workflow

In the data type specific chapters, we will cover the microarray workflow and file formats in more detail. But in the most general sense, microarray workflows look like this, note that the exact file formats are specific to the chip brand and type you use (e.g. Illumina, Affymetrix, Agilent, etc.):

7.6.1 Microarray file formats

7.6.1.1 IDAT - intensity data file

This is an Illumina microarray specific file that contains the chip image intensity information for each location on the microarray. It is a binary file, which means it will not be readable by double clicking and attempting to open the file directly.

Currently, Illumina appears to suggest directly converting IDAT files into a GTC format. We advise looking into this package to help you do that.

For more on IDAT files.

7.6.1.2 DAT - data file

This is an Affymetrix’ microarray specific file parallel to the IDAT file in that it contains the image intensity information for each location on the microarray. It’s stored as pixels.

For more on DAT files.

7.6.1.3 CEL

This is an Affymetrix microarray specific file that is made from a DAT file but translated into numeric values. It is not normalized yet but can be normalized into a CHP file.

For more on CEL files

7.6.1.4 CHP

CHP files contain the gene-level and normalized data from an Affymetrix array chip. CHP files are obtained by normalizing and processing CEL files.

For more about CHP files.

7.7 General informatics files

At various points in your genomics workflows, you may need to use other types of files to help you annotate your data. We’ll also discuss some of these common files that you may encounter:

7.7.0.1 BED - Browser Extensible Data

A BED file is a text file that has coordinates to genomic regions. THe other columns that accompany the genomic coordinates are variable depending on the context. But every BED file contains the chrom, chromStart and chromEnd columns to start.

A BED file might look like this:

chrom   chromStart  chromEnd other_optional_columns
chr1    0      1000  good
chr2    100    3000  bad

For more on BED files.

7.7.0.2 GFF/GTF General Feature Format/Gene Transfer Format

A GFF file is a tab delimited file that contains information about genomic features. These types of files are available from databases and what you can use to annotate your data.

You may see there are GFF2, GFF3, and GTF files. These only refer to different versions and variations. They generally have the same information. In general, GFF2 is being phased out so using GFF3 is generally a better bet unless the program or package you are using specifies it needs an older GFF2 version.

A GFF file may look like this (borrowed example from Ensembl):

1 transcribed_unprocessed_pseudogene  gene        11869 14409 . + . gene_id "ENSG00000223972"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene";

Note that it will be useful for annotating genes and what we know about them.

For more about GTF and GFF files.

7.7.1 Other files

* If you didn’t see a file type listed you are looking for, take a look at this list by the BROAD. Or, it may be covered in the data type specific chapters.

7.7.2 Microarray processing tutorials:

For the most common microarray platforms, you can see these examples for how to process the data:

Booth, Michael J, Tobias W B Ost, Dario Beraldi, Neil M Bell, Miguel R Branco, Wolf Reik, and Shankar Balasubramanian. 2013. “Oxidative Bisulfite Sequencing of 5-Methylcytosine and 5-Hydroxymethylcytosine.” Nature Protocols 8 (10): 1841–51. https://doi.org/10.1038/nprot.2013.115.

Govindarajan, Rajeshwar, Jeyapradha Duraiyan, Karunakaran Kaliyappan, and Murugesan Palanisamy. 2012. “Microarray and Its Applications.” Journal of Pharmacy & Bioallied Sciences 4 (Suppl 2): S310–12. https://doi.org/10.4103/0975-7406.100283.

Mantione, K. J., R. M. Kream, H. Kuzelova, R. Ptacek, J. Raboch, J. M. Samuel, and G. B. Stefano. 2014. “Comparing Bioinformatic Gene Expression Profiling Methods: Microarray and RNA-Seq.” Medical Science Monitor Basic Research 20 (August): 138–42. https://doi.org/10.12659/MSMBR.892101.

Slonim, Donna K., and Itai Yanai. 2009. “Getting Started in Gene Expression Microarray Analysis.” PLOS Computational Biology 5 (10): e1000543. https://doi.org/10.1371/journal.pcbi.1000543.

Tarca, A. L., R. Romero, and S. Draghici. 2006. “Analysis of Microarray Experiments of Gene Expression Profiling.” American Journal of Obstetrics and Gynecology 195 (2): 373–88. https://doi.org/10.1016/j.ajog.2006.07.001.

Zhang, Wenqian, Ying Yu, Falk Hertwig, Jean Thierry-Mieg, Wenwei Zhang, Danielle Thierry-Mieg, Jian Wang, et al. 2015. “Comparison of RNA-Seq and Microarray-Based Models for Clinical Endpoint Prediction.” Genome Biology 16 (1). https://doi.org/10.1186/s13059-015-0694-1.