Chapter 20 DNA Methylation Sequencing

This chapter is incomplete! If you wish to contribute, please go to this form or our GitHub page.

20.1 Learning Objectives

This chapter will demonstrate how to: Understand the basics of bisulfite sequencing data collection and processing workflow. Identify the next steps for your particular bisulfite sequencing data. Formulate questions to ask about your bisulfite sequencing data

20.2 What are the goals of analyzing DNA methylation?

To detect methylated cytosines (5mC), DNA samples are prepped using bisulfite (BS) conversion. This converts unmethylated cytosines into uracils and leaves methylated cytosines untouched. Probes are then designed to bind to either the uracil or the cytosine, representing the unmethylated and methylated cytosines respectively.

For a given sample, you will obtain a fraction, known as the Beta value, that indicates the relative abundance of the methylated and unmethylated versions of the sequence. Beta values exist then on a scale of 0 to 1 where 0 indicates none of this particular base is methylated in the sample and 1 indicates all are methylated.

Note that bisulfite conversion alone will not distinguish between 5mC and 5hmC though these often may indicate different biological mechanics.

Additionally, 5-hydroxymethylated cytosines (5hmC) can also be detected by oxidative bisulfite sequencing (OxBS) [Booth et al. (2013). oxidative bisulfite conversion measures both 5mC and 5hmC. If you want to identify 5hmC bases you either have to pair oxBS data with BS data OR you have to use Tet-assisted bisulfite (TAB) sequencing which will exclusively tag 5hmC bases (Yu et al. 2012).

20.3 Methylation data considerations

20.3.1 Beta values binomially distributed

Because beta values are a ratio, by their nature, they are not normally distributed data and should be treated appropriately. This means data models (like those used by the limma package) built for RNA-seq data should not be used on methylation data. More accurately, Beta values follow a binomial distribution.

This generally involves applying a generalized linear model.

20.3.2 Measuring 5mC and/or 5hmC

If your data and questions are interested in both 5mC and 5hmC, you will have separate sequencing datasets for each sample for both the BS and OBS processed samples. 5mC is often a step toward 5hmC conversion and therefore the 5mC and 5hmC measurements are, by nature, not independent from each other. In theory, 5mC, 5hmC and unmethylated cytosines should add up to 1.

Because of this, its been proposed that the most appropriate way to model these data is to combine them together in a model (Kochmanski, Savonen, and Bernstein 2019).

20.4 Methylation data workflow

$In a very general sense, methylation workflow involves sequence quality control and genome alignment like many other sequencing methods. But next, the data needs to be used to identify methylation calls and calculations of methylation fractions. Lastly, you will likely want to group the methylated bases together to identify what regions of the genome are differentially methylated and of interest.$

Like other sequencing methods, you will first need to start by quality control checks. Next, you will also need to align your sequences to the genome. Then, using the base calls, you will need to make methylation calls – which are methylated and which are not. This details of step depends on whether you are measuring 5mC and/or 5hmC methylation calls. Lastly, you will likely want to use your methylation calls as a whole to identify differentially methylated regions of interest.

20.5 Methylation Tools Pros and Cons

This following pros and cons sections have been written by AI and may need verification by experts. This is meant to give you a basic idea of the pros and cons of these tools but should ultimately be used with your own judgment.

20.5.1 Quality control:

FastQC: A popular tool for evaluating the quality of sequencing reads, generating various quality control plots and statistics. It is fast, easy to use and has a simple user interface (Andrews, n.d.).
- Pros: Fast and easy to use. Very commonly used. Provides various quality control metrics and plots. Can generate reports that can be easily shared with collaborators
- Cons: Does not perform any trimming or filtering of low-quality reads Not specifically designed for bisulfite sequencing data
Trim Galore!: A wrapper tool for Cutadapt and FastQC that provides a simple way to trim adapters and low-quality reads. It also has built-in support for bisulfite sequencing data (Krueger and Andrews, n.d.).
- Pros: Easy to use, with a simple command line interface. Automatically trims adapters and low-quality reads. Specifically designed for bisulfite sequencing data
- Cons: Limited flexibility in terms of the trimming and filtering options. Does not provide quality control metrics or plots

20.5.2 Analysis:

Bismark: A widely used tool for aligning bisulfite sequencing reads to a reference genome. It allows for paired-end and single-end reads, provides many options for handling sequencing errors and can output methylation calls in various formats (Liu et al. 2019).
- Pros: Performs alignment, quantification and methylation calling in a single tool. Can output methylation calls in various formats. Provides many options for handling sequencing errors and optimizing methylation calling parameters
- Cons:Can be computationally intensive for large datasets. Requires a pre-built bisulfite-converted reference genome
Bowtie2: A fast and efficient aligner that can be used for bisulfite sequencing data, and can align reads to bisulfite-converted genomes or to an unconverted genome with a pre-built bisulfite index (Langmead and Salzberg 2012).
- Pros: Very fast and efficient, making it suitable for large datasets. Can align reads to either a bisulfite-converted genome or to an unconverted genome with a pre-built bisulfite index. Provides options for handling sequencing errors and optimizing alignment parameters
- Cons: Does not perform methylation calling or quantification

20.5.3 Methylation calling:

Bismark: As well as performing alignment, Bismark can also be used to call methylation from aligned reads. It reports the percentage of cytosines methylated at each site (Liu et al. 2019).
- Pros: Performs both alignment and methylation calling in a single tool. Can output methylation calls in various formats. Provides many options for handling sequencing errors and optimizing methylation calling parameters
- Cons:Can be computationally intensive for large datasets. Requires a pre-built bisulfite-converted reference genome
MethylDackel: A fast and efficient tool for methylation calling from bisulfite sequencing data. It can output methylation calls in various formats, including a methylation bedGraph.
- Pros: Very fast and efficient, making it suitable for large datasets. Provides options for handling sequencing errors and optimizing methylation calling parameters. Can output methylation calls in various formats, including a methylation bedGraph
- Cons:Does not perform alignment or methylation quantification

20.5.4 Methylation quantification:

MethylKit: A popular tool for quantifying methylation levels from bisulfite sequencing data. It can handle various types of data and provides options for filtering out low-quality data and detecting differentially methylated regions (Akalin et al. 2012).
- Pros: Provides various options for filtering out low-quality data and detecting differentially methylated regions. Can handle various types of data, including bisulfite sequencing and reduced representation bisulfite sequencing. Provides many visualization tools for analyzing methylation data
- Cons: Can be computationally intensive for large datasets. Requires some knowledge of R programming language to use effectively
Bismark: As well as methylation calling, Bismark can also quantify methylation levels at each cytosine site. It reports the number of methylated and unmethylated reads, as well as the percentage of methylation (Liu et al. 2019).

20.5.5 Analysis:

DSS: A popular tool for identifying differentially methylated regions (DMRs) between groups of samples. It uses a statistical model to detect significant changes in methylation levels and reports DMRs with associated p-values (Feng and Conneely 2016).
- Pros: Uses a statistical model to identify differentially methylated regions between groups of samples. Provides various options for controlling false discovery rate and adjusting for multiple comparisons. Suitable for large datasets.
- Cons: Requires some knowledge of statistical methods and programming language to use effectively. May not be suitable for smaller datasets or datasets with low coverage.
MethylKit: As well as methylation quantification, MethylKit can also be used for downstream analysis, such as clustering samples based on methylation patterns and performing functional annotation of differentially methylated regions (Akalin et al. 2012).

20.6 More resources

DNA methylation analysis with Galaxy tutorial
The mint pipeline for analyzing methylation and hydroxymethylation data.
Book chapter about finding methylation regions of interest

References

Akalin, Altuna, Matthias Kormaksson, Sheng Li, Francine E Garrett-Bakelman, Maria E Figueroa, Ari Melnick, and Christopher E Mason. 2012. “methylKit: A Comprehensive r Package for the Analysis of Genome-Wide DNA Methylation Profiles.” Genome Biology 13 (10): R87. https://genomebiology.biomedcentral.com/articles/10.1186/gb-2012-13-10-r87.

Andrews, Simon. n.d. “FastQC: A Quality Control Tool for High Throughput Sequence Data.” https://www.bioinformatics.babraham.ac.uk/projects/fastqc/.

Booth, Michael J, Tobias W B Ost, Dario Beraldi, Neil M Bell, Miguel R Branco, Wolf Reik, and Shankar Balasubramanian. 2013. “Oxidative Bisulfite Sequencing of 5-Methylcytosine and 5-Hydroxymethylcytosine.” Nature Protocols 8 (10): 1841–51. https://doi.org/10.1038/nprot.2013.115.

Feng, Hui, and Karen N Conneely. 2016. “Differential Methylation Analysis for BS-Seq Data Under General Experimental Design.” Bioinformatics 32 (2): 289–91. https://pubmed.ncbi.nlm.nih.gov/26819470/.

Kochmanski, Joseph, Candace Savonen, and Alison I. Bernstein. 2019 10. https://doi.org/10.3389/fgene.2019.00801.

Krueger, Felix, and Simon R Andrews. n.d. “Trim Galore!: A Wrapper Tool Around Cutadapt and FastQC to Consistently Apply Quality and Adapter Trimming to FastQ Files.” https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/.

Langmead, Ben, and Steven L Salzberg. 2012. “Fast Gapped-Read Alignment with Bowtie 2.” Nature Methods 9 (4): 357–59. https://www.nature.com/articles/nmeth.1923.

Liu, Yi, Kimberly D Siegmund, Peter W Laird, and Benjamin P Berman. 2019. “Bismark: A Flexible Aligner and Methylation Caller for Bisulfite-Seq Applications.” Bioinformatics 36 (22-23): 5280–82. https://academic.oup.com/bioinformatics/article/27/11/1571/216956.

Yu, Miao, Gary C Hon, Keith E Szulwach, Chun-Xiao Song, Peng Jin, Bing Ren, and Chuan He. 2012. “Tet-Assisted Bisulfite Sequencing of 5-Hydroxymethylcytosine.” Nature Protocols 7 (12): 2159–70. https://doi.org/10.1038/nprot.2012.137.