Chapter 9 DNA Methods Overview

This chapter is in a beta stage. If you wish to contribute, please go to this form or our GitHub page.

9.1 Learning Objectives

Learning objectives This chapter will demonstrate how to: Understand the goals and data collection for DNA sequence collection and variant identification. Compare and contrast the following methods: DNA/SNP microarrays, Whole Genome Sequencing, Whole Exome Sequencing, and Targeted Sequencing

9.2 What are the goals of analyzing DNA sequences?

There are several larger goals behind DNA sequencing experiments ranging from assembling whole genomes, to identifying variation or performing a functional genomic analysis or comparative genomic study. Each of these has implications when studying disease.

  • Assembling whole genomes:

    Because an organism’s genome determines how an organism develops and functions (NHGRI 2024), an important task in the genomics field is assembling the genome of an organism from sequencing reads. This assembly process attempts to reconstruct how the sequencing reads overlap or fit together (Schatz, Delcher, and Salzberg 2010; Li and Durbin 2024). Recent examples of genome assembly in the genomics field include a complete 3.055 billion-base pair sequence of the human reference genome which was published by the Telomere-to-Telomere (T2T) Consortium (2022), the T2T-CHM13 version (followed not long after by the complete sequence of the human Y chromosome (2023)). A goal of the field is to better capture human genetic diversity by creating a reference pangenome, assembled from multiple donors within the population (2024). Genome assemblies are an important part of genomics beyond human genomics research; there are reference gnomes available for most model organisms as well as many plants, animals, and pathogens, with more and more being published at a high frequency (Miller, Zimin, and Gordus 2023; Alonge et al. 2022; Gershman et al. 2023; Sistrom et al. 2016). These reference genomes each act as an extensive compilation of the observed DNA sequence of genes, regulatory elements, etc. and the related coordinate systems for these elements, such that, for the corresponding organism, sequencing reads from other experiments can be mapped or aligned to the reference in order to localize where that read was in the genome. In the case of cancer informatics, a recent approach utilized personalized genome assembly to more accurately detect tumor somatic mutations. This is likely to be an area of future research for application in precision medicine (Xiao et al. 2022; Ermini and Driguez 2024).

  • Identifying variation:

    Variant caller software is used within the field of genomics to identify places where reads from a DNA sequencing experiment differ from a comparative reference genome sequence (NHGRI 2022). Variants may be as small as single nucleotide differences (single-nucleotide polymorphisms or SNPs) or much larger (50 base pairs or more) structural variation (SVs) such as duplications, deletions, insertions, inversions, translocations (Wong, Hudson, and McPherson 2011). (Shorter insertions or deletions are termed indels.) The SVs involving gains or losses in genomic DNA can lead to copy number variations (CNVs). Mutation and structural variants are very common in cancer as well as larger-scale catastrophic genomic rearrangements (C.-Z. Zhang and Pellman 2022). Overall, variants may be rare in a population or fairly common (Audano et al. 2019). Further, variants may be somatic or germline variants: germline variants are hereditary and will be passed down from parent to offspring; in the offspring, the variant will be present in every cell, while somatic variants are generally not hereditary and present only in some cells rather than every cell (Frost 2022). Because variation, specifically genetic diversity is a necessary part of a healthy species (“What Is Genetic Diversity and Why Does It Matter?” n.d.) and because variation, specifically mutations/variants may cause disease, identifying variation is a common goal in a DNA sequencing workflow. An example of research focusing on studying genetic diversity in humans is the 1000 Genomes Project which recently expanded its resource of sequenced genomes and in doing so discovered even more variation present in the population (Byrska-Bishop et al. 2022).

  • Functional genomic analysis:

    Genomes contain more than just genes (the coding sequences that will be transcribed and translated into a protein); they also contain functional elements such as promoters, enhancers, or silencers that modulate the expression of genes (Kellis et al. 2014). Further, differential gene expression is the phenomenon by which cells with the same DNA sequence show different patterns of gene expression. Functional genomic analyses aim to better understand differential gene expression and the impact of genetic variation found in functional elements. For example, many human genetic variants associated with common traits and diseases are localized in or near known functional elements (Hindorff et al. 2009). These variants may impact gene expression due to either changes in transcription factor binding at that site, or resulting epigenetic changes, which are defined as chemical modifications of chromatin or nucleotides beyond the DNA sequence. Such epigenetic modifications, which include histone marks and DNA methylation, can alter DNA compaction and influence a functional element’s accessibility for transcriptional machinery (e.g., if the element isn’t accessible, transcription may not occur; while previously the element was accessible and the gene could be transcribed). In later sections, methods that study epigenetic modifications like chromatin accessibility, DNA methylation, or binding of specific proteins will be discussed. All of these methods support functional genomic analyses and are important for better understanding differential gene expression and the impact of genetic variants located in functional elements may have on disease occurrence. A somewhat recent and high profile example of a functional genomic analysis centers again on work from the T2T Consortium. Not only did they publish a new, complete reference genome, but they also studied the epigenetic landscape in the newly resolved regions of the genome and pointed to potential newly discovered functional elements in a region previously thought to be transcriptionally inactive (Gershman et al. 2022).

  • Comparative genomics:

    A common saying in the genomics field is that structure determines function and conserved structure may be constrained such that there is an important function which needs to be conserved (Alföldi and Lindblad-Toh 2013). Further, similarities in structure may be due to shared ancestry through the processes of evolution; therefore, some comparative genomics studies aim to infer homology or an evolutionary relationship from structural similarity (Pearson 2013). More pertinent to the topics discussed previously, comparative genomics studies are also useful for identifying functional elements (J. Taylor et al. 2006) and variants associated with disease (e.g., by comparing the genomes of those with the disease and those without it and identifying differences) (Alföldi and Lindblad-Toh 2013; Eichler 2019).

9.3 Comparison of DNA methods

Comparing DNA Sequencing Techniques. The most common DNA sequencing techniques are described. Whole genome sequencing coverages all genes and non-coding DNA. 3.2 billion bases are covered when applied to human samples. This the most expensive of the techniques. Depth of coverage required for 99.9% sensitivity is 30X. Whole exome sequencing coverage is the exome or expressed genes. Approximately 45 million bases are sequenced. This is a cost-effective technique. The depth of coverage required for 99.9% sensitivity is 100X. Targeted gene panel sequencing coverages 50-500 genes. 20,000 to 62 million bases are sequenced. This is the most cost-effective technique. Depth of coverage is >500X. There are four DNA sequencing methods discussed in this chapter. The above graph compares WGS, WXS, and Targeted gene sequencing. The last section compares all 4.

  1. Whole genome sequencing (WGS)
  2. Whole exome sequencing (WXS)
  3. Targeted gene sequencing
  4. DNA/SNP microarrays

Compared to WXS and Targeted Gene Sequencing, WGS is the most expensive but requires the lowest depth of coverage to achieve 95% sensitivity. In other words, WGS requires sequencing each region of the genome (3.2 billion bases) 30 times in order to confidently be able to pick up all possible meaningful variants. (Sims et al. 2014) goes into more depth on how these depths are calculated.

Alternatively, WXS is a more cost effective way to study the genome, focusing places in the genome that have open reading frames – aka generally genes that are able to be expressed. This focuses on enriching for exons and not introns so splicing variants may be missed. In this case, each gene must be sequenced 80-100x for sufficient sensitivity to pick up meaningful variants.

In targeted gene sequencing, a panel of 50-500 regions of interest are selected. This technique is very applicable for studying a set of specific genes of interest at great depth to identify all varieties of mutations within those specific genes. These genes must be sequenced at much greater depth (>500x) to confidently identify all meaningful variants. This page from Illumina also provides information regarding sequencing depth considerations for different modalities.

Additional references: WGS: (Bentley et al. 2008) WES: (Clark et al. 2011) Targeted: (Bewicke-Copley et al. 2019)

9.4 How to choose a DNA sequencing method

Before starting any sequencing method, you likely have a research question or hypothesis in mind. In order to choose a DNA sequencing method, you will need to consider a few items in balance of each other:

9.4.1 1. What region(s) of the genome pertain to your research question?

Is this unknown? Can it be narrowed down to non-coding or coding regions? Is there an even more specific subset of interest?

9.4.2 2. What does your project budget allow for?

Some methods are much more costly than others. Cost is not only a factor for the reagents needed to sequence, but also the computing power needed to process and store the data and people’s compensation for their work on the data. All of these costs increase as the amounts of data that are collected increase. For more information on computing decisions see our Computing in Cancer Informatics course.

9.4.3 3. What is your detection power for these variants?

Detecting DNA variants is not simply a matter of yes or no, but a confidence level due to sequencing errors in data collection. Are the variants you are looking for very rare and/or small (single nucleotide or very few copy number differences)? If so you will need more samples and potentially more sequencing depth to detect these variants with confidence.

9.5 Strengths and Weaknesses of different methods

Is not much known about DNA variants in your organism or disease in question? In this instance you may want to cast a large net to explore more variants by using WGS.

If previous research has identified sections of the genome that are of interest to your research question, then it’s highly advisable to not sequence the entire genome with WGS methods. Not only will whole genome sequencing be more costly, but it will decrease your statistical power to discover true positive variants of interest and increase your chances of discovering false positive variants. This is because multiple testing correction needs to be applied in instances where many tests are being done currently. In this instance, the tests being performed are across the whole genome.

If your research question does not pertain to non-coding regions of the genome or splicing, then its advisable to use WXS. Recall that only about 1-2% of the genome is coding sequences meaning that if you are uninterested in noncoding regions but still use WGS then 98-99% of your data will be uninteresting to you and will only serve to increase your chances of finding false positives or cost you a lot of funding. Not only does sequencing more of the genome take more money and time but it will be more costly in time and resources in terms of the computing power needed to analyze it.

Furthermore, if you are able to narrow down even further what regions are of interest this would be better in terms of cost and detection abilities. A targeted sequencing panel or DNA microarray are ideal for assaying known groups of targets. DNA microarrays are the least costly of all the methods to identify DNA variants, but with both targeted sequencing and DNA microarray you will need to find or create a custom probe or primer set. Ideally a probe or primer set that hits your regions of interest already exists commercially but if not, then you will have to design your own – which also costs time and money.

There are three general methods we will discuss for evaluating DNA sequences. Whole Genome Sequencing (WGS) assays more of the genome than other methods but is much more costly and computationally intensive. Depending on your goals WGS may be overkill. SNP microarrays on the other hand, are much more cost effective but are not able to be used for exploratory purposes. Whole Exome Sequencing (WXS or WES) and other targeted sequencing methods allow you to survey regions of the genome in way that is more cost effective and potentially at higher depths.

In these upcoming chapters we will discuss in more detail each of these methods, what the data represent, what you need to consider, and what resources you can consult for analyzing your data.

References

Alföldi, Jessica, and Kerstin Lindblad-Toh. 2013. “Comparative Genomics as a Tool to Understand Evolution and Disease.” Genome Research 23 (7): 1063–68. https://doi.org/10.1101/gr.157503.113.
Alonge, Michael, Ludivine Lebeigle, Melanie Kirsche, Katie Jenike, Shujun Ou, Sergey Aganezov, Xingang Wang, Zachary B. Lippman, Michael C. Schatz, and Sebastian Soyk. 2022. “Automated Assembly Scaffolding Using RagTag Elevates a New Tomato System for High-Throughput Genome Editing.” Genome Biology 23 (1): 258. https://doi.org/10.1186/s13059-022-02823-7.
Audano, Peter A., Arvis Sulovari, Tina A. Graves-Lindsay, Stuart Cantsilieris, Melanie Sorensen, AnneMarie E. Welch, Max L. Dougherty, et al. 2019. “Characterizing the Major Structural Variant Alleles of the Human Genome.” Cell 176 (3): 663–675.e19. https://doi.org/10.1016/j.cell.2018.12.019.
Bentley, David R., Shankar Balasubramanian, Harold P. Swerdlow, Geoffrey P. Smith, John Milton, Clive G. Brown, Kevin P. Hall, et al. 2008. “Accurate Whole Human Genome Sequencing Using Reversible Terminator Chemistry.” Nature 456 (7218): 53–59. https://doi.org/10.1038/nature07517.
Bewicke-Copley, Findlay, Emil Arjun Kumar, Giuseppe Palladino, Koorosh Korfi, and Jun Wang. 2019. “Applications and Analysis of Targeted Genomic Sequencing in Cancer Studies.” Computational and Structural Biotechnology Journal 17: 1348–59. https://doi.org/10.1016/j.csbj.2019.10.004.
Byrska-Bishop, Marta, Uday S. Evani, Xuefang Zhao, Anna O. Basile, Haley J. Abel, Allison A. Regier, André Corvelo, et al. 2022. “High-Coverage Whole-Genome Sequencing of the Expanded 1000 Genomes Project Cohort Including 602 Trios.” Cell 185 (18): 3426–3440.e19. https://doi.org/10.1016/j.cell.2022.08.004.
Clark, Michael J, Rui Chen, Hugo Y K Lam, Konrad J Karczewski, Rong Chen, Ghia Euskirchen, Atul J Butte, and Michael Snyder. 2011. “Performance Comparison of Exome DNA Sequencing Technologies.” Nature Biotechnology 29 (10): 908–14. https://doi.org/10.1038/nbt.1975.
Eichler, Evan E. 2019. “Genetic Variation, Comparative Genomics, and the Diagnosis of Disease.” The New England Journal of Medicine 381 (1): 64–74. https://doi.org/10.1056/NEJMra1809315.
Ermini, Luca, and Patrick Driguez. 2024. “The Application of Long-Read Sequencing to Cancer.” Cancers 16 (77): 1275. https://doi.org/10.3390/cancers16071275.
Frost, Dr Amy. 2022. “Constitutional (Germline) Vs Somatic (Tumour) Variants.” NHS. https://www.genomicseducation.hee.nhs.uk/genotes/knowledge-hub/constitutional-germline-vs-somatic-tumour-variants/.
Gershman, Ariel, Quinn Hauck, Morag Dick, Jerrica M. Jamison, Michael Tassia, Xabier Agirrezabala, Saad Muhammad, et al. 2023. “Genomic Insights into Metabolic Flux in Hummingbirds.” Genome Research 33 (5): 703–14. https://doi.org/10.1101/gr.276779.122.
Gershman, Ariel, Michael E. G. Sauria, Xavi Guitart, Mitchell R. Vollger, Paul W. Hook, Savannah J. Hoyt, Miten Jain, et al. 2022. “Epigenetic Patterns in a Complete Human Genome.” Science 376 (6588): eabj5089. https://doi.org/10.1126/science.abj5089.
Hindorff, Lucia A., Praveen Sethupathy, Heather A. Junkins, Erin M. Ramos, Jayashri P. Mehta, Francis S. Collins, and Teri A. Manolio. 2009. “Potential Etiologic and Functional Implications of Genome-Wide Association Loci for Human Diseases and Traits.” Proceedings of the National Academy of Sciences 106 (23): 9362–67. https://doi.org/10.1073/pnas.0903103106.
Kellis, Manolis, Barbara Wold, Michael P. Snyder, Bradley E. Bernstein, Anshul Kundaje, Georgi K. Marinov, Lucas D. Ward, et al. 2014. “Defining Functional DNA Elements in the Human Genome.” Proceedings of the National Academy of Sciences 111 (17): 6131–38. https://doi.org/10.1073/pnas.1318948111.
Li, Heng, and Richard Durbin. 2024. “Genome Assembly in the Telomere-to-Telomere Era.” Nature Reviews Genetics, April, 1–13. https://doi.org/10.1038/s41576-024-00718-w.
Miller, Jeremiah, Aleksey V Zimin, and Andrew Gordus. 2023. “Chromosome-Level Genome and the Identification of Sex Chromosomes in Uloborus Diversus.” GigaScience 12 (January): giad002. https://doi.org/10.1093/gigascience/giad002.
NHGRI. 2022. “Genomic Data Science.” https://www.genome.gov/about-genomics/fact-sheets/Genomic-Data-Science.
———. 2024. “Genome.” https://www.genome.gov/genetics-glossary/Genome.
Nurk, Sergey, Sergey Koren, Arang Rhie, Mikko Rautiainen, Andrey V. Bzikadze, Alla Mikheenko, Mitchell R. Vollger, et al. 2022. “The Complete Sequence of a Human Genome.” Science 376 (6588): 44–53. https://doi.org/10.1126/science.abj6987.
Pearson, William R. 2013. “An Introduction to Sequence Similarity (‘Homology’) Searching.” Current Protocols in Bioinformatics 42 (1). https://doi.org/10.1002/0471250953.bi0301s42.
Rhie, Arang, Sergey Nurk, Monika Cechova, Savannah J. Hoyt, Dylan J. Taylor, Nicolas Altemose, Paul W. Hook, et al. 2023. “The Complete Sequence of a Human y Chromosome.” Nature 621 (7978): 344–54. https://doi.org/10.1038/s41586-023-06457-y.
Schatz, Michael C., Arthur L. Delcher, and Steven L. Salzberg. 2010. “Assembly of Large Genomes Using Second-Generation Sequencing.” Genome Research 20 (9): 1165–73. https://doi.org/10.1101/gr.101360.109.
Sims, David, Ian Sudbery, Nicholas E. Ilott, Andreas Heger, and Chris P. Ponting. 2014. “Sequencing Depth and Coverage: Key Considerations in Genomic Analyses.” Nature Reviews Genetics 15 (2): 121–32. https://doi.org/10.1038/nrg3642.
Sistrom, Mark, Benjamin Evans, Joshua Benoit, Oliver Balmer, Serap Aksoy, and Adalgisa Caccone. 2016. “De Novo Genome Assembly Shows Genome Wide Similarity Between Trypanosoma Brucei Brucei and Trypanosoma Brucei Rhodesiense.” PLOS ONE 11 (2): e0147660. https://doi.org/10.1371/journal.pone.0147660.
Taylor, Dylan J., Jordan M. Eizenga, Qiuhui Li, Arun Das, Katharine M. Jenike, Eimear E. Kenny, Karen H. Miga, et al. 2024. “Beyond the Human Genome Project: The Age of Complete Human Genome Sequences and Pangenome References.” Annual Review of Genomics and Human Genetics, April. https://doi.org/10.1146/annurev-genom-021623-081639.
Taylor, James, Svitlana Tyekucheva, David C. King, Ross C. Hardison, Webb Miller, and Francesca Chiaromonte. 2006. “ESPERR: Learning Strong and Weak Signals in Genomic Sequence Alignments to Identify Functional Elements.” Genome Research 16 (12): 1596–1604. https://doi.org/10.1101/gr.4537706.
“What Is Genetic Diversity and Why Does It Matter?” n.d. Frontiers for Young Minds. https://kids.frontiersin.org/articles/10.3389/frym.2021.656168.
Wong, Kit Man, Thomas J. Hudson, and John D. McPherson. 2011. “Unraveling the Genetics of Cancer: Genome Sequencing and Beyond.” Annual Review of Genomics and Human Genetics 12 (1): 407–30. https://doi.org/10.1146/annurev-genom-082509-141532.
Xiao, Chunlin, Zhong Chen, Wanqiu Chen, Cory Padilla, Michael Colgan, Wenjun Wu, Li-Tai Fang, et al. 2022. “Personalized Genome Assembly for Accurate Cancer Somatic Mutation Discovery Using Tumor-Normal Paired Reference Samples.” Genome Biology 23 (1): 237. https://doi.org/10.1186/s13059-022-02803-x.
Zhang, Cheng-Zhong, and David Pellman. 2022. “Cancer Genomic Rearrangements and Copy Number Alterations from Errors in Cell Division.” Annual Review of Cancer Biology 6 (1): 245–68. https://doi.org/10.1146/annurev-cancerbio-070620-094029.