To the Editor — Potential users of single-cell RNA-sequencing (scRNA-seq)1 often encounter a choice between highthroughput droplet-based methods and high-sensitivity plate-based methods. There is a widespread belief that scRNA-seq will often fail to generate measurements for some genes from some cells owing to technical molecular inefficiencies. It is believed that this causes data to have an overabundance of zero values compared to what is expected from random sampling and that this effect is particularly pronounced in droplet-based methods. Here I present an investigation of published data for technical controls in droplet-based scRNA-seq experiments that demonstrates that the number of zero values in the data is consistent with common distributional models of molecule sampling counts. Thus, any additional zero values in biological data likely result from biological variation or may reflect variation in gene abundance among cell types or cell states. As scRNA-seq started to gain popularity, users expressed concern about the number of zero values among gene expression measures. That is, for any given gene, several cells showed no expression, even if expression was relatively high in other cells2–4. These ‘technical zeros’ are considered distinct from ‘biological zeros’, which are zero because the gene is not expressed5. The original reference to this ‘dropout’ problem of yielding an inflation of zero values in scRNA-seq data was in the description of single-cell differential expression (SCDE)6. It was reported that genes in scRNA-seq data generated with the low-throughput plate-based methods single-cell tagged reverse transcription (STRT)-seq and Smart-seq resulted in more zero values when comparing two single-cell samples than was typically observed when comparing two traditional bulk RNAseq samples. To solve the task of reliably finding shifts in mean between conditions, a mixture model was used that could model the unexpected zeros. There is no clear record in the literature from where the idea of ‘zero inflation’ in droplet-based data arose. The conceptual origin seems to have been the combination of the model in SCDE attempting to model counts in scRNA-seq data as well as expectations based on intuition about continuous distributions. Over time, several statistical tests, such as model-based analysis of single-cell transcriptomics (MAST)7 and eventually dimensionality reduction methods with special handling of zero values had been introduced8,9. All these approaches share two common themes (which deviate from SCDE): first, expression data are considered continuous with additional zero values, and second, a proportional relation is identified between the number of zero values and the average expression level of a gene10. In the field of computational methods for scRNA-seq analysis, many methods have been designed to correct zero values in data, with the aim of allowing users to predict what the expression level of a gene in a cell would have been, had there been no zero-inflation or dropouts11–17, referred to as ‘imputation’. High-throughput droplet-based scRNAseq methods measure discrete counts of unique molecular identifiers (UMIs). The stochastic sampling of counts can be modeled using a Gamma-Poisson distribution, more colloquially known as the ‘negative binomial’ distribution. Under stochastic sampling of counts, zero values are expected. When researchers discuss ‘zero-inflation’ or ‘dropouts’, they refer to observing more ‘technical zeros’ than expected, which would pose challenges when estimating molecular abundance. This is not a concern with the expected ‘sampling zeros’5. It has recently become popular to make statistical models for scRNA-seq counts using the ‘zero-inflated negative binomial’ distribution18–20. This distribution adds a probability of observing a zero value in any given draw from the distribution of each gene. Closer investigations seem to counter the assumption that zero inflation is an inherent property of scRNA-seq data. A methods paper on simulations for scRNA-seq data reported negative binomial to be sufficient for UMI count data, with little added benefit from a zero-inflation component21. Similarly, the authors of the method bayNorm found that zero-inflation was not necessary16. One group proposed that genes with higher fractions of zero values than suggested by the negative binomial distribution might be good candidates for further analysis because this seems to reflect biological variation22. Finally, another investigation suggested that what appears as zero inflation is an effect of log transformation of the counts, and factor models aware of count distributions alleviate the problem23. To investigate the number of zero values in droplet-based high-throughput scRNA-seq platforms, it is possible to use negative control data with no biological variation. This will answer whether technical shortcomings in scRNA-seq methods produce an excess of technical zeros compared to expectations. Negative-control datasets have been generated by adding a solution of RNA to the fluid in microfluidic systems, making the RNA content in each droplet identical. Five such datasets have been published: one to benchmark Drop-seq24, one to benchmark InDrops25, one to benchmark an early version of the commercial scRNAseq platform from 10x Genomics26 and two to benchmark a later version of the commercial platform from 10x Genomics27 (Supplementary Note 1). The negative control data differ in the RNA solution added: bulk RNA from cells or tissues or artificial External RNA Control Consortium (ERCC) RNA (Table 1). For simplicity, here I refer to each RNA species, regardless of source, as a ‘gene’. All negative control experiments aimed to make an RNA dilution that would fill each droplet with an RNA amount similar to that in a typical mammalian cell. The negative binomial distribution has two parameters: a ‘mean’ (sometimes called ‘rate’) parameter μ and a ‘dispersion’ parameter φ. Historically, the negative binomial distribution has been used to model data in which there is unknown random variation in the exposure compared with a Poisson distribution28, where the dispersion φ determines the increased amount of variation. For negative control data, it is reasonable to expect φ to be common for all genes as they are all affected by the same sampling process. In bulk RNA-seq data, gene-wise φ parameters are often used to account for biological variation in gene expression between samples29. A negative binomial distribution with mean μ and dispersion φ will have the probability of observing a count of 0 determined by
[1]
Christoph Ziegenhain,et al.
powsimR: Power analysis for bulk and single cell RNA-seq experiments
,
2017,
bioRxiv.
[2]
Valentine Svensson,et al.
Power Analysis of Single Cell RNA-Sequencing Experiments
,
2016,
Nature Methods.
[3]
F. W. Townes,et al.
Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model
,
2019,
Genome Biology.
[4]
David A. Knowles,et al.
Batch effects and the effective design of single-cell gene expression studies
,
2016,
Scientific Reports.
[5]
S. Dudoit,et al.
A general and flexible method for signal extraction from single-cell RNA-seq data
,
2018,
Nature Communications.
[6]
Joshua W. K. Ho,et al.
CIDR: Ultrafast and accurate clustering through imputation for single-cell RNA-seq data
,
2016,
Genome Biology.
[7]
Ambrose J. Carr,et al.
Bayesian Inference for Single-cell Clustering and Imputing
,
2017
.
[8]
Kathryn Roeder,et al.
A United Statistical Framework for Single Cell and Bulk Sequencing Data
,
2016,
bioRxiv.
[9]
Wenhao Tang,et al.
bayNorm: Bayesian gene expression recovery, imputation and normalization for single-cell RNA-sequencing data
,
2019,
Bioinform..
[10]
Nancy R. Zhang,et al.
SAVER: Gene expression recovery for single-cell RNA sequencing
,
2018,
Nature Methods.
[11]
Michael I. Jordan,et al.
Deep Generative Modeling for Single-cell Transcriptomics
,
2018,
Nature Methods.
[12]
Il-Youp Kwak,et al.
DrImpute: imputing dropout events in single cell RNA sequencing data
,
2017,
BMC Bioinformatics.
[13]
A. Raj,et al.
Single mammalian cells compensate for differences in cellular volume and DNA copy number through independent global transcriptional mechanisms.
,
2015,
Molecular cell.
[14]
Kevin R. Moon,et al.
Recovering Gene Interactions from Single-Cell Data Using Data Diffusion
,
2018,
Cell.
[15]
Grace X. Y. Zheng,et al.
Massively parallel digital transcriptional profiling of single cells
,
2016,
Nature Communications.
[16]
S. Teichmann,et al.
From Tissues to Cell Types and Back: Single-Cell Gene Expression Analysis of Tissue Architecture
,
2018,
Annual Review of Biomedical Data Science.
[17]
Sayan Mukherjee,et al.
Naught all zeros in sequence count data are the same
,
2018,
bioRxiv.
[18]
Wei Vivian Li,et al.
An accurate and robust imputation method scImpute for single-cell RNA-seq data
,
2018,
Nature Communications.
[19]
Rhonda Bacher,et al.
Design and computational analysis of single-cell RNA-sequencing experiments
,
2016,
Genome Biology.
[20]
E. Pierson,et al.
ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression analysis
,
2015,
Genome Biology.
[21]
Evan Z. Macosko,et al.
Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets
,
2015,
Cell.
[22]
Allon M. Klein,et al.
Droplet Barcoding for Single-Cell Transcriptomics Applied to Embryonic Stem Cells
,
2015,
Cell.
[23]
Sandrine Dudoit,et al.
Normalizing single-cell RNA sequencing data: challenges and opportunities
,
2017,
Nature Methods.
[24]
Mark D. Robinson,et al.
edgeR: a Bioconductor package for differential expression analysis of digital gene expression data
,
2009,
Bioinform..
[25]
Fabian J Theis,et al.
Single-cell RNA-seq denoising using a deep count autoencoder
,
2019,
Nature Communications.
[26]
P. Kharchenko,et al.
Bayesian approach to single-cell differential expression analysis
,
2014,
Nature Methods.
[27]
S. Teichmann,et al.
A practical guide to single-cell RNA-sequencing for biomedical research and clinical applications
,
2017,
Genome Medicine.
[28]
Martin Hemberg,et al.
M3Drop: dropout-based feature selection for scRNASeq
,
2018,
Bioinform..
[29]
P. Linsley,et al.
MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data
,
2015,
Genome Biology.