Chapter 13 Single-cell RNA-seq

This chapter is in a beta stage. If you wish to contribute, please go to this form or our GitHub page.

13.1 Learning Objectives

This chapter will demonstrate how to: Understand the basics of single cell RNA-Seq data collection and processing workflow. Identify the next steps for your particular single cell RNA-seq data. Formulate questions to ask about your single cell RNA-seq data

13.2 Where single-cell RNA-seq data comes from

As opposed to bulk RNA-seq which can only tell us about tissue level and within patient variation, single-cell RNA-seq is able to tell us cell to cell variation in transcriptomics including intra-tumor heterogeneity

As opposed to bulk RNA-seq which can only tell us about tissue level and within patient variation, single-cell RNA-seq is able to tell us cell to cell variation in transcriptomics including intra-tumor heterogeneity.

Single cell RNA-seq can give us cell level transcriptional profiles. Whereas bulk RNA-seq masks cell to cell heterogeneity. If your research questions require cell-level transcriptional information, single-cell RNA-seq will on interest to you.

Single cell RNA-seq can give us cell level transcriptional profiles. Whereas bulk RNA-seq masks cell to cell heterogeneity.

13.3 Single-cell RNA-seq data types

There are broadly two categories of single-cell RNA-seq data methods we will discuss.

  • Full length RNA-seq: Individual cells are physically separated and then sequenced.
  • Tag Based RNA-seq: Individual cells are tagged with a barcode and their data is separated computationally.

Depending on your goals for your single cell RNA-seq analysis, you may want to choose one method over the other.

Full length single cell RNA-seq **Pros**: Can be paired end sequencing which has less 3' bias. More complete coverage of transcripts which may be better for transcript discovery purposes. Cons: Is not very efficient (96 wells per plate). Takes longer to run days/weeks depending on the sample size. Expensive.

Tag based single cell RNA-seq. Pros: Can profile up to millions of cells. Takes less computing power. File storage requirements are smaller. Much less expensive. Cons: More intense 3' bias. Coverage is not as deep as full length single cell RNA-seq

(Material borrowed from (“Alex’s Lemonade Training Modules” 2022)).

13.3.1 Unique Molecular identifiers

Often Tag based single cell RNA-seq methods will include not only a cell barcode for cell identification but will also have a unique molecular identifier (UMI) for original molecule identification. The idea behind the UMIs is it is a way to have insight into the original snapshot of the cell and potentially combat PCR amplification biases.

Tag based single cell RNA-seq. Pros: Can profile up to millions of cells. Takes less computing power. File storage requirements are smaller. Much less expensive. Cons: More intense 3' bias. Coverage is not as deep as full length single cell RNA-seq

13.4 Single cell RNA-seq tools

There are a lot of scRNA-seq tools for various steps along the way.

In a very general sense, single cell RNA-seq workflows involves first quantification/alignment. You will also need to conduct quality control steps that may involve using UMIs to check for what’s detected, detecting duplets, and using this information to filter out data that is not trustworthy. After you have a set of reliable data, you need to normalize your data. Single cell data is highly skewed - a lot of genes barely or not detected and a few genes that are detected a lot. After data has been normalized you are ready to conduct your downstream analyses. This will be highly dependent on the original goals and questions of your experiment. It may include dimension reduction, cell classification, differential expression, detecting cell trajectories or any number of other analyses.

In a very general sense, single cell RNA-seq workflows involves first quantification/alignment. You will also need to conduct quality control steps that may involve using UMIs to check for what’s detected, detecting duplets, and using this information to filter out data that is not trustworthy. After you have a set of reliable data, you need to normalize your data. Single cell data is highly skewed - a lot of genes barely or not detected and a few genes that are detected a lot. After data has been normalized you are ready to conduct your downstream analyses. This will be highly dependent on the original goals and questions of your experiment. It may include dimension reduction, cell classification, differential expression, detecting cell trajectories or any number of other analyses.

Each step of this very general representation of a workflow can be conducted by a variety of tools. We will highlight some of the more popular tools here. But, to look through a full list, you can consult the scRNA-tools website.

13.5 Quantification and alignment tools

This following pros and cons sections have been written by AI and may need verification by experts. This is meant to give you a basic idea of the pros and cons of these tools but should ultimately be used with your own judgment.

  • STAR:
    • Pros: Accurate alignment of RNA-seq reads to the genome. Can handle a wide range of RNA-seq protocols, including scRNA-seq. Provides read counts and gene-level expression values.
    • Cons: Requires a significant amount of memory and computational resources. May be difficult to set up and run for beginners.
  • HISAT2:
    • Pros: Accurate alignment of RNA-seq reads to the genome. Provides transcript-level expression values. Supports splice-aware alignment.
    • Cons: May require significant computational resources for large datasets. May not be as accurate as some other alignment tools.

This following pros and cons sections have been written by AI and may need verification by experts. This is meant to give you a basic idea of the pros and cons of these tools but should ultimately be used with your own judgment.

  • STAR (Dobin et al. 2013):
    • Pros: Accurate alignment of RNA-seq reads to the genome. Can handle a wide range of RNA-seq protocols, including scRNA-seq. Provides read counts and gene-level expression values.
    • Cons: Requires a significant amount of memory and computational resources. May be difficult to set up and run for beginners.
  • HISAT2 (Kim, Langmead, and Salzberg 2015):
    • Pros: Accurate alignment of RNA-seq reads to the genome. Provides transcript-level expression values. Supports splice-aware alignment.
    • Cons: May require significant computational resources for large datasets. May not be as accurate as some other alignment tools.
  • Kallisto bustools (Bray et al. 2016):
    • Pros: Fast and accurate quantification of RNA-seq reads without the need for alignment. Provides transcript-level expression values. Requires less memory and computational resources than alignment-based methods.
    • Cons: May not be as accurate as alignment-based methods for lowly expressed genes. Cannot provide allele-specific expression estimates.

Alevin/Salmon (Patro et al. 2017): - Pros: Fast and accurate quantification of RNA-seq reads without the need for alignment. Provides transcript-level expression values. Supports both single-end and paired-end sequencing. - Cons: May not be as accurate as alignment-based methods for lowly expressed genes. Cannot provide allele-specific expression estimates.

  • Cell Ranger (Zheng et al. 2017):
    • Pros: Specifically designed for 10x Genomics scRNA-seq data, with optimized workflows for alignment and quantification. Provides read counts and gene-level expression values. Offers a streamlined pipeline with minimal input from the user.
    • Cons: Limited options for customizing parameters or analysis methods. May not be suitable for datasets from other scRNA-seq platforms.

13.6 Downstream tools Pros and Cons

  • Seurat:
    • Pros: Has a wide range of functionalities for preprocessing, clustering, differential expression, and visualization. Can handle multiple modalities, including CITE-seq and ATAC-seq. Has a large and active user community, with extensive documentation and tutorials available.
    • Cons: Can be computationally intensive, especially for large datasets. Requires some knowledge of R programming language.
  • Scanpy:
    • Pros: Written in Python, a widely used programming language in bioinformatics. Has a user-friendly interface and extensive documentation. Offers a variety of preprocessing, clustering, and differential expression methods, as well as interactive visualizations.
    • Cons: May not be as feature-rich as some other tools, such as Seurat. Does not yet support multiple modalities.
  • Monocle:
    • Pros:Focuses on trajectory analysis, allowing users to explore developmental trajectories and cell fate decisions. Has a user-friendly interface and extensive documentation. Can handle data from multiple platforms, including Smart-seq2 and Drop-seq.
    • Cons: May not be as feature-rich for clustering or differential expression analysis as some other tools. Requires some knowledge of R programming language.
  • Monocle:
    • Pros:Focuses on trajectory analysis, allowing users to explore developmental trajectories and cell fate decisions. Has a user-friendly interface and extensive documentation. Can handle data from multiple platforms, including Smart-seq2 and Drop-seq.
    • Cons: May not be as feature-rich for clustering or differential expression analysis as some other tools. Requires some knowledge of R programming language.

13.6.1 Doublet Tool Pros and Cons

  • DoubletFinder(McGinnis, Murrow, and Gartner 2020):
    • Pros: Uses a machine learning approach to detect doublets based on transcriptome similarity. Can be used with a variety of scRNA-seq platforms. Offers a user-friendly interface and extensive documentation.
    • Cons: Can be computationally intensive for large datasets. May require some knowledge of R programming language.
  • Scrublet (Wolock, Krishnaswamy, and Huang 2019):
    • Pros: Uses a density-based approach to detect doublets based on barcode sharing. Fast and computationally efficient, making it suitable for large datasets. Offers a user-friendly interface and extensive documentation.
    • Cons:May not be as accurate as other methods, especially for low-quality data. Limited to 10x Genomics data.
  • DoubletDecon (De Pasquale and Dudoit 2019):
    • Pros: Uses a statistical approach to identify doublets based on the distribution of the number of unique molecular identifiers (UMIs) per cell. Can be used with different platforms and species. Offers a user-friendly interface and extensive documentation.
    • Cons: May not be as accurate as other methods, especially for data with low sequencing depth or low cell numbers. Requires some knowledge of R programming language.

It’s important to note that no doublet detection method is perfect, and it’s often a good idea to combine multiple methods to increase the accuracy of doublet identification. Additionally, manual inspection of the data is always recommended to confirm the presence or absence of doublets.

13.7 More scRNA-seq tools and tutorials

13.8 Visualization GUI tools

  • WebMeV uniquely provides a user-friendly, intuitive, interactive interface to processed analytical data uses cloud-computing elasticity for computationally intensive analyses and is compatible with single cell or bulk RNA-seq input data.
  • UCSC Xena is a web-based visualization tool for multi-omic data and associated clinical and phenotypic annotations. It can be used with single cell RNA-seq data.
  • Integrative Genomics Viewer (IGV) is a track-based browser for interactively exploring genomic data mapped to a reference genome.

13.9 Useful tutorials

These tutorials cover explicit steps, code, tool recommendations and other considerations for analyzing RNA-seq data.

References

“Alex’s Lemonade Training Modules.” 2022. https://github.com/AlexsLemonade/training-modules.
Angerer, Philipp, Lukas Simon, Sophie Tritschler, F. Alexander Wolf, David Fischer, and Fabian J. Theis. 2017. “Single Cells Make Big Data: New Challenges and Opportunities in Transcriptomics.” Current Opinion in Systems Biology 4 (August): 85–91. https://doi.org/10.1016/j.coisb.2017.07.004.
Baran-Gale, Jeanette, Tamir Chandra, and Kristina Kirschner. 2018. “Experimental Design for Single-Cell RNA Sequencing.” Briefings in Functional Genomics 17 (4): 233–39. https://doi.org/10.1093/bfgp/elx035.
Bray, Nicolas L, Harold Pimentel, Páll Melsted, and Lior Pachter. 2016. “Near-Optimal Probabilistic RNA-Seq Quantification.” Nature Biotechnology 34 (5): 525–27. https://www.nature.com/articles/nbt.3519.
Brüning, Ralf Schulze, Lukas Tombor, Marcel H. Schulz, Stefanie Dimmeler, and David John. 2021. “Comparative Analysis of Common Alignment Tools for Single Cell RNA Sequencing.” bioRxiv. https://doi.org/10.1101/2021.02.15.430948.
De Pasquale, Elisa, and Sandrine Dudoit. 2019. “DoubletDecon: Deconvoluting Doublets from Single-Cell RNA-Sequencing Data.” Cell Reports 29 (6): 1718–1727. e8. https://www.sciencedirect.com/science/article/pii/S2211124719312860.
Dobin, Alexander, Carrie A Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha, Philippe Batut, Mark Chaisson, and Thomas R Gingeras. 2013. “STAR: Ultrafast Universal RNA-Seq Aligner.” Bioinformatics 29 (1): 15–21. https://academic.oup.com/bioinformatics/article/29/1/15/272537.
Kim, Daehwan, Ben Langmead, and Steven L Salzberg. 2015. “HISAT: A Fast Spliced Aligner with Low Memory Requirements.” Nature Methods 12 (4): 357–60. https://www.nature.com/articles/nmeth.3317.
Luecken, Malte D, and Fabian J Theis. 2019. “Current Best Practices in Single‐cell RNA‐seq Analysis: A Tutorial.” Molecular Systems Biology 15 (6). https://doi.org/10.15252/msb.20188746.
McGinnis, Christopher S, Lillian M Murrow, and Zev J Gartner. 2020. “DoubletFinder: Doublet Detection in Single-Cell RNA Sequencing Data Using Artificial Nearest Neighbors.” Cell Systems 8 (4): 329–337. e4. https://pubmed.ncbi.nlm.nih.gov/30954475/.
Patro, Rob, Geet Duggal, Michael I Love, Rafael A Irizarry, and Carl Kingsford. 2017. “Salmon Provides Fast and Bias-Aware Quantification of Transcript Expression.” Nature Methods 14 (4): 417–19. https://pubmed.ncbi.nlm.nih.gov/28263959/.
Smith, Tom. 2015. “Unique Molecular Identifiers – the Problem, the Solution and the Proof.” CGAT. https://cgatoxford.wordpress.com/2015/08/14/unique-molecular-identifiers-the-problem-the-solution-and-the-proof/.
Wolock, Samuel L, Smita Krishnaswamy, and B Jesse Huang. 2019. “Scrublet: Computational Identification of Cell Doublets in Single-Cell Transcriptomic Data.” Cell Systems 8 (4): 281–291. e9. https://pubmed.ncbi.nlm.nih.gov/30954476/.
Zheng, Grace XY, Jessica M Terry, Phillip Belgrader, Paul Ryvkin, Zachary W Bent, Ryan Wilson, Solongo B Ziraldo, et al. 2017. “Massively Parallel Digital Transcriptional Profiling of Single Cells.” Nature Communications 8 (1): 1–12. https://www.nature.com/articles/ncomms14049.