Chapter 16 ATAC-Seq

This chapter is incomplete! If you wish to contribute, please go to this form or our GitHub page.

16.1 Learning Objectives

Learning objectives This chapter will demonstrate how to: Understand the basics of ATAC-Seq data collection and processing workflow. Identify the next steps for your particular ATAC-Seq data. Formulate questions to ask about your ATAC-Seq data

16.2 What are the goals of ATAC-Seq analysis?

The goals of ATAC-seq are to identify the accessible regions of the genome in a particular set of samples. These data allow us to understand the relationships between the chromatin accessibility patterns and cell states, and to understand the mechanistic causes and consequences of these chromatin accessibility patterns.

What does accessibility to chromatin represent? In ATAC-seq we are able to sequence open chromatin and find out DNA sequences where chromatin is accessible for activity.

ATAC-seq data is generated by fragmenting the genome with the Tn5 endonuclease and sequencing the shorter DNA fragments. While most of the genome is associated with protein complexes that preclude the digestion of DNA by Tn5, some regions of the genome have accessible chromatin that can be cleaved by Tn5 resulting in short (<500bp) fragments. These regions of the genome are of biological interest as they are likely to harbor transcription factor binding sites and to constitute cis-regulatory elements, genomic regions that are involved in the regulation of gene expression.

Schematic of how Tn5 fragments open chromatin + inserts adapters. This step is important for the quick protocol and low required cell inputs of ATAC-seq

16.2.1 What questions can be answered with ATAC-seq?

What types of questions can we ask with ATAC-seq?What regions of the genome have accessible chromatin? How does accessibility differ between biological samples or change over time? What transcription factor motifs or transcription factor footprints can be found at accessible regions of interest?

16.3 ATAC-Seq general workflow overview

A basic ATAC-seq workflow involves mapping sequence reads to the genome, identifying peaks, assessing data quality, and identifying patterns of interest through clustering or identification of differentially accessible regions or other statistical means.

16.3.1 Data quality metrics:

16.3.1.1 Pre-sequencing QC:

16.3.1.2 Sequencing considerations:

Single end sequencing. Cheaper. OK for most standard applications. Paired-end sequencing. More expensive. Useful for looking at nucleosome positioning and transcription factor footprinting

Single vs. paired end sequencing. Single. Cheaper. OK for most standard applications. Paired-end sequencing. More expensive. Useful for looking at nucleosome positioning and transcription factor footprinting. Read length & read depth. 75bp or more read length (keep in mind nucleosomes are 147bp). ~50 million reads/sample usually recommended

16.3.1.3 Pre-alignment QC:

Post-sequencing. Signal to noise ratio (link resources at end) Comparison with DNase hypersensitivity datasets (or other computational QC method- check current resources available)

A tool like FastQC or similar should be used to check for GC content, read quality and length, and primer or adapter reads prior to alignment. Trimmomatic is a useful tool for removing primer and adapter sequences if they are present. ATAC-seq experiments should be sequenced with paired-end sequencing, and existing pipelines will expect paired-end. (2 files *_R1.fastq and *_R2.fastq)

Use fasterq-dump to download files from NCBI Sequence Read Archive - this tool will automatically split the reads in multiple files

16.3.1.4 Number of mapped reads

As for all DNA-sequencing based genomics technologies, a sufficient number of mapped reads is required to obtain meaningful results from a sample. You can read more about general sequencing technologies in our previous chapter here. For experiments on human samples this number should be greater than 20 million mapped unique reads.

Bowtie2 is commonly used for mapping fragments to the genome.

16.3.1.5 Post-alignment QC:

Post-sequencing. Signal to noise ratio (link resources at end) Comparison with DNase hypersensitivity datasets (or other computational QC method- check current resources available)

Post alignment: check percent of matched, unmatched, unpaired and duplicated reads. Reads which are duplicated or unmatched should be filtered out. Picard is a useful tool for this step. Reads on the + strand should be shifted +4bp, reads on the - strand should be shifted -5 bp.

16.3.1.6 Fragment size distribution:

ATAC-seq data is often generated using paired end sequencing technologies, which allow for characterization of ATAC-seq fragments. Histograms of these distributions using single base pair resolution bins reveal patterns of enrichment relative to the nucleosome scale of 147bp and the DNA-helix scale ~10.5bp.

Considerations for quality data: QC checkpoints. Pre-sequencing. Library distribution

When comparing ATAC-seq samples, it is important to consider the fragment size distributions of the samples being compared. Differences in the distributions could lead to results that are unrelated to biology.

16.3.1.7 Peak calling:

ATAC-seq peak calling typically makes use of analysis tools developed for ChIP-seq. MACS2 is one of the most common choices for a peak calling tool, but HOMER or other common ChIP-seq peak callers are also acceptable. An input sample is not typically generated for ATAC-seq as it would be for a ChIP-seq experiment, so the major requirement for the peak caller is that it does not require the input control to call peaks.

Overview of ATAC-seq data analysis pipeline #### Number of peaks:

Although the number of accessible chromatin regions can vary from one cell type to another, there are several regions that appear to be constitutively accessible across most cell types. At least 20,000 peaks can be identified in a high quality experiment. The deeper the sequencing the more peaks will be detected in an ATAC-seq experiments. At a very high sequencing depth some of the statistically significant peaks might not be of biological interest. In an analysis of such data sets the fold enrichment relative to background, or absolute peak signal, in addition to statistical significance, ought to be taken into account.

16.3.1.8 FRiP score (fraction of reads in peaks)

In high quality ATAC-seq data a large fraction of reads overlap with peaks, while in low quality data there is a high level of fragments that map to background regions. Ideally, the FRiP score is greater than 0.3 (30 percent or more of reads overlap with peaks), with a score below 0.2 indicating low-quality data

16.3.1.9 Overlap with other chromatin accessibility data

Thousands of ATAC-seq samples have been produced in human and mouse. High quality ATAC-seq data will share a substantial proportion of peaks with many of these datasets. Publicly available ATAC-seq data can be found and comparisons made at the Cistrome Data Browser [http://cistrome.org/db/].

16.3.1.10 Overlap with promoters

The promoter regions of many genes are constitutively accessible. Examining peak overlap with regions close to known protein coding gene transcription start sites can be used as a check for data quality.

16.3.2 Information from ATAC-seq analysis:

16.3.2.1 Major approaches:

Compare changes in transcription factor motif enrichment in accessible regions between samples
Compare changes in accessibility of regions (differential accessibility) between samples
Footprinting - identify regions where insertion is below expected level

16.3.2.2 Differential accessibility analysis:

Differential accessibility analysis typically uses packages for RNA-seq differential expression analysis such as DEseq2, edgeR, or limma.

All three are available as R packages and can be installed using Bioconductor, a bioinformatics package manager for R. Unfortunately, there are no well-established packages for this analysis in other languages such as Python. Differential accessibility analysis is an approach with high potential, but care must be taken in processing and normalizing the data for accurate results.

16.3.2.3 Motif analysis:

Motif analysis in ATAC-seq is more complex than for ChIP-seq because a larger set of TFs are responsible for the emergence of chromatin accessible regions than for the binding sites of a particular TF. Nevertheless, in the analysis of differential ATAC-seq peaks motif analysis can be used to reveal the TFs related to differences between conditions. This type of analysis is most likely to be successful when the ATAC-seq between closely related conditions or cell types is being compared.

The MEME suite has a variety of tools for motif analysis available in both web and command-line versions.

16.3.2.4 Motif Scanning

Motif scanning is an analysis technique which identifies putative transcription factor binding sites (TFBS) which sufficiently match a given TF motif’s position-weight matrix. PWMscan is a straightforward online tool, but not the best option for high throughput. FIMO is an alternative which can be used either on the web or the command line. This approach will identify all sites within the genome which are likely to bind a single transcription factor.

16.3.2.5 Motif discovery:

Homer or MEME. These tools identify overrepresented sequences within the accessible peaks, regardless of whether they match a previously defined motif. Once the ATAC-seq peaks are determined, the next step is to search for enriched DNA sequence motifs within these regions. This is accomplished by using motif discovery algorithms such as MEME Suite, HOMER, or DREME. These tools scan the ATAC-seq peaks for overrepresented sequence patterns, which may correspond to binding sites for specific transcription factors or other regulatory elements. The motifs discovered can be compared against existing motif databases, such as JASPAR or TRANSFAC, to annotate the potential transcription factor binding sites.

16.3.2.6 Motif Enrichment:

These motif enrichment tools will scan through and identify matches to known motif sequences within accessible sites, and additionally will quantify whether the motif is significantly enriched compared to a control sample (input, uncommon with ATAC-seq) or a shuffled sequence to mimic background.

After identifying the enriched motifs, researchers can perform motif enrichment analysis to determine the significance of these motifs in the ATAC-seq peaks. This is often done using statistical tools like Fisher’s exact test or hypergeometric test, which assess the enrichment of specific motifs compared to their background occurrence in the genome. Additionally, tools like GREAT or HOMER can be employed to perform gene ontology analysis and assess the functional relevance of the identified motifs in biological processes and pathways.

Overall, ATAC-seq motif enrichment analysis provides researchers with valuable insights into the regulatory landscape of the genome. By identifying enriched motifs within accessible chromatin regions, researchers can gain a deeper understanding of the transcriptional regulatory networks and potentially uncover novel transcription factors involved in specific biological processes or diseases. This analysis serves as a powerful tool for unraveling the intricacies of gene regulation and can pave the way for further investigations in functional genomics and therapeutic development. Homer or MEME suite tools.

16.4 ATAC-Seq data strengths:

The ATAC-seq is easy to adopt and has been used by many laboratories to generate high quality data for characterizing accessible chromatin in cell lines or sorted cells derived from tissues.
In principle, ATAC-seq can identify a large proportion of cis-regulatory elements.
In contrast to ChIP-seq, ATAC-seq does not require specific antibodies- ATAC-seq is a time-efficient protocol which requires low cell input.
In comparison with histone modification ChIP-seq, ATAC-seq provides a higher resolution assessment of the cis-regulatory genomic regions. Histone modification ChIP-seq, in contrast, tends to be localized on nucleosomes flanking the site of interest and can spread to nucleosomes beyond the immediate flanking ones.

16.5 ATAC-Seq data limitations:

ATAC-seq does not precisely identify the transcription factors or other chromatin associated factors that bind in or around chromatin accessible regions. This type of information needs to be inferred through analysis of transcription factor binding motif analysis or ChIP-seq data.
Whereas ATAC-seq indicates the presence of a putative cis-regulatory element, H3K27ac ChIP-seq is able to separate accessible regions from those that are accessible and active.
Accessible regions are not necessarily cis-regulatory regions, although many of them are.
The genes that are regulated by cis-regulatory elements cannot be identified conclusively by ATAC-seq alone.
ATAC-seq data can be biased, and affected by batch effects like any other genomics data type. When comparing ATAC-seq data good experimental design principles like the inclusion of biological replicates and consideration of controls, are needed for a meaningful outcome. .

16.6 ATAC-Seq data considerations

The nucleosome is the fundamental unit of chromatin packaging in the genome and nucleosomal DNA is far less likely to be cleaved by the Tn5 nuclease than linker DNA. When DNA is fragmented by Tn5 the positions of the endpoints relative to the nucleosomes is an important consideration. When the ends are less than 147bp apart it is likely that both ends originate from the same linker region. Longer fragments can result from cuts on opposite sides of the same nucleosome, or even opposite sides of a genomic interval that encompasses multiple nucleosomes. The short fragments are therefore most likely to be nucleosome free and provide stronger evidence for transcription factor binding sites.

As will other genomics protocols, ATAC-seq data is subject to biases introduced in the ATAC-seq protocol and in the sequencing itself. Comparison of ATAC-seq data generated in different batches, by different laboratories or using different protocols might not be directly comparable. In addition, the Tn5 endonuclease does have biases in the precise DNA sequences it can cut. This should be taken into consideration when carrying out base pair resolution analyses including footprinting analysis and analysis of the effects of sequence variants on chromatin accessibility.

Read depth will impact ATAC-seq signal, but enzyme strength and conditions can also alter the distribution of cuts.

When using ATAC-seq data to answer biological questions it is important to understand what types of bias could impact the results. To ensure valid results the analysis needs to use appropriate statistical methods, ensure enough high quality ATAC-seq data is available, including controls, and possibly reframing the questions.

16.7 ATAC-seq analysis tools

This section has been written by AI and needs verification by experts. This is meant to give you a basic idea of the pros and cons of these tools but should ultimately be used with your own judgment.

MACS2(Y. Zhang et al. 2008):
- Pros: widely used, handles both paired-end and single-end sequencing data, allows for differential peak calling between different samples.
- Cons: assumes that all peaks have the same shape, may not be as accurate as other peak-calling tools in some cases.
HOMER(Heinz et al. 2010):
- Pros: includes tools for peak-calling, motif analysis, and annotation of nearby genes, user-friendly interface, handles both paired-end and single-end sequencing data.
- Cons: may not be as accurate as other peak-calling tools in some cases.
ATACseqQC(Schep et al. 2017):
- Pros: provides several metrics and plots for evaluating data quality, identifies potential issues with data such as batch effects, sequencing depth, and library complexity.
- Cons: does not perform peak-calling or downstream analysis.
deeptools(Ramı́rez et al. 2016):
- Pros: includes tools for normalization, visualization, and comparison of ATAC-seq data, generates heatmaps, profiles, and other plots for visualizing chromatin accessibility.
- Cons: may require some programming skills to use effectively.
DFilter (Ghavi-Helm et al. 2019):
- Pros: uses a deep learning approach to predict the likelihood of a genomic region being an ATAC-seq peak, can handle both paired-end and single-end sequencing data, has been shown to outperform other peak-calling tools in some cases.
- Cons: may require more computational resources than other tools.

16.8 Additional tutorials and tools

This section has been written by AI and needs verification by experts. This is meant to give you a basic idea of the pros and cons of these tools but should ultimately be used with your own judgment.

MACS2(Y. Zhang et al. 2008):
- Pros: widely used, handles both paired-end and single-end sequencing data, allows for differential peak calling between different samples.
- Cons: assumes that all peaks have the same shape, may not be as accurate as other peak-calling tools in some cases.
HOMER(Heinz et al. 2010):
- Pros: includes tools for peak-calling, motif analysis, and annotation of nearby genes, user-friendly interface, handles both paired-end and single-end sequencing data.
- Cons: may not be as accurate as other peak-calling tools in some cases.
ATACseqQC(Schep et al. 2017):
- Pros: provides several metrics and plots for evaluating data quality, identifies potential issues with data such as batch effects, sequencing depth, and library complexity.
- Cons: does not perform peak-calling or downstream analysis.
deeptools(Ramı́rez et al. 2016):
- Pros: includes tools for normalization, visualization, and comparison of ATAC-seq data, generates heatmaps, profiles, and other plots for visualizing chromatin accessibility.
- Cons: may require some programming skills to use effectively.
DFilter (Ghavi-Helm et al. 2019):
- Pros: uses a deep learning approach to predict the likelihood of a genomic region being an ATAC-seq peak, can handle both paired-end and single-end sequencing data, has been shown to outperform other peak-calling tools in some cases.
- Cons: may require more computational resources than other tools.

16.9 Additional tutorials and tools

A Galaxy based tutorial for ATAC-seq - Galaxy is a good recommendation for those new to informatics who would like a cloud-based GUI option to use for the analysis of their data.
MACS - Model-based analysis for ChIP-Seq - A command line tool for the identification of transcription factor binding sites. Can be used with ChIP-seq or ATAC-seq.
CHIPS - A Snakemake pipeline for quality control and reproducible processing of chromatin profiling data. This tool will require some snakemake and coding knowledge. For more recommendations about coding see our later chapter about general data analysis tools.
Cistrome DB - a visual tool to allow you to browse your ATAC-seq data.
SELMA - Simplex Encoded Linear Model for Accessible Chromatin - SELMA is a python based tool for the assessment of biases in Chromatin based data.

16.10 Online Visualization tools

Cistrome DB - a visual tool to allow you to browse your ATAC-seq data.
UCSC Xena is a web-based visualization tool for multi-omic data and associated clinical and phenotypic annotations. It can be used with ATAC-seq data.
Integrative Genomics Viewer (IGV) is a track-based browser for interactively exploring genomic data mapped to a reference genome.

16.11 More resources about ATAC-seq data

ATAC-seq overview from Galaxy - these slides explain the overarching concepts of ATAC-seq.
ATAC seq guidelines from Harvard - this workflow runs through step by step how to analysis ATAC-seq data and what different parameters mean.
ATAC-seq review - this paper gives a great overview of ATAC-seq data and step by step what needs to be considered.
Identifying and mitigating bias in chromatin
CHIP Snakemake pipeline for analyzing ChIP-seq and chromatin accessibility data
Paper on bias in DNase-seq footprinting analysis and fragment size effects, similar comments apply to ATAC-seq
SELMA Method for evaluating footprint bias in ATAC-seq

References

Ghavi-Helm, Yad, Aleksander Jankowski, Sascha Meiers, Rebecca R Viales, Jan O Korbel, and Eileen EM Furlong. 2019. “Highly Rearranged Chromosomes Reveal Uncoupling Between Genome Topology and Gene Expression.” Nature Genetics 51 (8): 1272–82.

Heinz, S, Christopher Benner, Nathanael Spann, Eric Bertolino, Yen Ching Lin, Petra Laslo, Jia Cheng, et al. 2010. “Simple Combinations of Lineage-Determining Transcription Factors Prime Cis-Regulatory Elements Required for Macrophage and b Cell Identities.” Molecular Cell 38 (4): 576–89. https://pubmed.ncbi.nlm.nih.gov/20513432/.

Ramı́rez, Fidel, Devon P Ryan, Bjørn Grüning, Vivek Bhardwaj, Fabian Kilpert, Andreas S Richter, Steffen Heyne, et al. 2016. “deepTools2: A Next Generation Web Server for Deep-Sequencing Data Analysis.” Nucleic Acids Research 44 (W1): W160–65.

Schep, Alicia N, Bin Wu, Jason D Buenrostro, and William J Greenleaf. 2017. “chromVAR: Inferring Transcription-Factor-Associated Accessibility from Single-Cell Epigenomic Data.” Nature Methods 14 (10): 975–78. https://www.nature.com/articles/nmeth.4401.

Zhang, Yong, Tao Liu, Clifford A Meyer, Jie Eeckhoute, David S Johnson, Bradley E Bernstein, Chad Nusbaum, et al. 2008. “Model-Based Analysis of ChIP-Seq (MACS).” Genome Biology 9 (9): R137. https://genomebiology.biomedcentral.com/articles/10.1186/gb-2008-9-9-r137.