A systematic evaluation of single-cell RNA-sequencing imputation methods

The rapid development of single-cell RNA-sequencing (scRNA-seq) technology, with increased sparsity compared to bulk RNA-sequencing (RNA-seq), has led to the emergence of many methods for preprocessing, including imputation methods. Here, we systematically evaluate the performance of 18 state-of-the-art scRNA-seq imputation methods using cell line and tissue data measured across experimental protocols. Specifically, we assess the similarity of imputed cell profiles to bulk samples as well as investigate whether methods recover relevant biological signals or introduce spurious noise in three downstream analyses: differential expression, unsupervised clustering, and inferring pseudotemporal trajectories. Broadly, we found significant variability in the performance of the methods across evaluation settings. While most scRNA-seq imputation methods recover biological expression observed in bulk RNA-seq data, the majority of the methods do not improve performance in downstream analyses compared to no imputation, in particular for clustering and trajectory analysis, and thus should be used with caution. Furthermore, we find that the performance of scRNA-seq imputation methods depends on many factors including the experimental protocol, the sparsity of the data, the number of cells in the dataset, and the magnitude of the effect sizes. We summarize our results and provide a key set of recommendations for users and investigators to navigate the current space of scRNA-seq imputation methods.

[1]  L. J. K. Wee,et al.  Reference component analysis of single-cell transcriptomes elucidates cellular heterogeneity in human colorectal tumors , 2017, Nature Genetics.

[2]  Maxim N. Artyomov,et al.  Complete deconvolution of cellular mixtures based on linearity of transcriptional signatures , 2019, Nature Communications.

[3]  Åsa K. Björklund,et al.  Full-length RNA-seq from single cells using Smart-seq2 , 2014, Nature Protocols.

[4]  Lior Pachter,et al.  RNA velocity and protein acceleration from single-cell multiomics experiments , 2019, bioRxiv.

[5]  Xuegong Zhang,et al.  scRecover: Discriminating true and false zeros in single-cell RNA-seq data for imputation , 2019, bioRxiv.

[6]  Darren J. Burgess,et al.  Spatial transcriptomics coming of age , 2019, Nature Reviews Genetics.

[7]  Gioele La Manno,et al.  Quantitative single-cell RNA-seq with unique molecular identifiers , 2013, Nature Methods.

[8]  Andrew J. Hill,et al.  Single-cell mRNA quantification and differential analysis with Census , 2017, Nature Methods.

[9]  R. Satija,et al.  Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression , 2019, Genome Biology.

[10]  Beate Vieth,et al.  A systematic evaluation of single cell RNA-seq analysis pipelines , 2019, Nature Communications.

[11]  Lana X. Garmire,et al.  DeepImpute: an accurate, fast, and scalable deep neural network method to impute single-cell RNA-seq data , 2018, Genome Biology.

[12]  Alexander Zelikovsky,et al.  12 Grand Challenges in Single-Cell Data Science , 2019, PeerJ Prepr..

[13]  Michael I. Jordan,et al.  Deep Generative Modeling for Single-cell Transcriptomics , 2018, Nature Methods.

[14]  Angshul Majumdar,et al.  AutoImpute: Autoencoder based imputation of single-cell RNA-seq data , 2018, Scientific Reports.

[15]  Fabian J Theis,et al.  Single-cell RNA-seq denoising using a deep count autoencoder , 2019, Nature Communications.

[16]  Hongkai Ji,et al.  Global prediction of chromatin accessibility using small-cell-number and single-cell RNA-seq , 2019, Nucleic acids research.

[17]  Davis J. McCarthy,et al.  A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor , 2016, F1000Research.

[18]  Sanguk Kim,et al.  Capicua restricts cancer stem cell-like properties in breast cancer cells , 2020, Oncogene.

[19]  Leland McInnes,et al.  UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction , 2018, ArXiv.

[20]  Hongkai Ji,et al.  A systematic evaluation of single-cell RNA-sequencing imputation methods , 2020, Genome biology.

[21]  Christoph Ziegenhain,et al.  powsimR: Power analysis for bulk and single cell RNA-seq experiments , 2017, bioRxiv.

[22]  J. Davis Bioinformatics and Computational Biology Solutions Using R and Bioconductor , 2007 .

[23]  Christoph Ziegenhain,et al.  A systematic evaluation of single cell RNA-seq analysis pipelines , 2019, Nature Communications.

[24]  Andrew C. Adey,et al.  Cicero Predicts cis-Regulatory DNA Interactions from Single-Cell Chromatin Accessibility Data. , 2018, Molecular cell.

[25]  James T. Webber,et al.  Molecular Cross-Validation for Single-Cell RNA-seq , 2019, bioRxiv.

[26]  Luyi Tian,et al.  Benchmarking single cell RNA-sequencing analysis pipelines using mixture control experiments , 2019, Nature Methods.

[27]  Gábor Csárdi,et al.  The igraph software package for complex network research , 2006 .

[28]  C. Spearman The proof and measurement of association between two things. , 2015, International journal of epidemiology.

[29]  Erik Sundström,et al.  RNA velocity of single cells , 2018, Nature.

[30]  Tallulah S Andrews,et al.  False signals induced by single-cell imputation , 2018, F1000Research.

[31]  Philipp Thomas,et al.  bayNorm: Bayesian gene expression recovery, imputation and normalization for single-cell RNA-sequencing data , 2018, bioRxiv.

[32]  J. Marioni,et al.  Pooling across cells to normalize single-cell RNA sequencing data with many zero counts , 2016, Genome Biology.

[33]  Chengzhong Ye,et al.  DECENT: differential expression with capture efficiency adjustmeNT for single-cell RNA-seq data , 2017, bioRxiv.

[34]  Paul T. Groth,et al.  The ENCODE (ENCyclopedia Of DNA Elements) Project , 2004, Science.

[35]  Evan Z. Macosko,et al.  Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets , 2015, Cell.

[36]  Fabian J Theis,et al.  The Human Cell Atlas , 2017, bioRxiv.

[37]  Chen Xu,et al.  Identification of cell types from single-cell transcriptomes using a novel clustering method , 2015, Bioinform..

[38]  Xiang Zhou,et al.  VIPER: variability-preserving imputation for accurate gene expression recovery in single-cell RNA sequencing studies , 2018, Genome Biology.

[39]  John C Marioni,et al.  A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor , 2016, F1000Research.

[40]  Howard Y. Chang,et al.  Lineage-specific and single cell chromatin accessibility charts human hematopoiesis and leukemia evolution , 2016, Nature Genetics.

[41]  Jingshu Wang,et al.  Data denoising with transfer learning in single-cell transcriptomics , 2019, Nature Methods.

[42]  Nancy R. Zhang,et al.  SAVER: Gene expression recovery for single-cell RNA sequencing , 2018, Nature Methods.

[43]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[44]  Lihua Zhang,et al.  Comparison of Computational Methods for Imputing Single-Cell RNA-Sequencing Data , 2020, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[45]  M. Cugmas,et al.  On comparing partitions , 2015 .

[46]  Erik van Nimwegen,et al.  Bayesian inference of the gene expression states of single cells from scRNA-seq data , 2019, bioRxiv.

[47]  Raphael Gottardo,et al.  Orchestrating high-throughput genomic analysis with Bioconductor , 2015, Nature Methods.

[48]  Rafael A. Irizarry,et al.  Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model , 2019, Genome Biology.

[49]  Lihua Zhang,et al.  PBLR: an accurate single cell RNA-seq data imputation tool considering cell heterogeneity and prior expression level of dropouts , 2018 .

[50]  R. Irizarry,et al.  Missing data and technical variability in single‐cell RNA‐sequencing experiments , 2018, Biostatistics.

[51]  Yong-Yeol Ahn,et al.  The Impact of Random Models on Clustering Similarity , 2017, bioRxiv.

[52]  Evan Z. Macosko,et al.  Slide-seq: A scalable technology for measuring genome-wide expression at high spatial resolution , 2019, Science.

[53]  Martin J. Aryee,et al.  Integrated Single-Cell Analysis Maps the Continuous Regulatory Landscape of Human Hematopoietic Differentiation , 2018, Cell.

[54]  Christoph Bock,et al.  Ultra-high throughput single-cell RNA sequencing by combinatorial fluidic indexing , 2019, bioRxiv.

[55]  Florian Wagner,et al.  K-nearest neighbor smoothing for high-throughput single-cell RNA-Seq data , 2017, bioRxiv.

[56]  Angshul Majumdar,et al.  McImpute: Matrix Completion Based Imputation for Single Cell RNA-seq Data , 2018, bioRxiv.

[57]  Marie-Liesse Asselin-Labat,et al.  RNA-seq mixology: designing realistic control experiments to compare protocols and analysis methods , 2016, bioRxiv.

[58]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[59]  Andrew McDavid,et al.  Data exploration, quality control and testing in single-cell qPCR-based gene expression experiments , 2012, Bioinform..

[60]  Il-Youp Kwak,et al.  DrImpute: imputing dropout events in single cell RNA sequencing data , 2017, BMC Bioinformatics.

[61]  Thomas Lengauer,et al.  ROCR: visualizing classifier performance in R , 2005, Bioinform..

[62]  Li Wang,et al.  Dimensionality Reduction Via Graph Structure Learning , 2015, KDD.

[63]  James T. Webber,et al.  Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris , 2018, Nature.

[64]  Gordon K. Smyth,et al.  limma: Linear Models for Microarray Data , 2005 .

[65]  K. Holt,et al.  Performance of neural network basecalling tools for Oxford Nanopore sequencing , 2019, Genome Biology.

[66]  Kevin R. Moon,et al.  Recovering Gene Interactions from Single-Cell Data Using Data Diffusion , 2018, Cell.

[67]  Lior Pachter,et al.  RNA Velocity: Molecular Kinetics from Single-Cell RNA-Seq. , 2018, Molecular cell.

[68]  Y. Kluger,et al.  Zero-preserving imputation of scRNA-seq data using low-rank approximation , 2018, bioRxiv.

[69]  Grace X. Y. Zheng,et al.  Massively parallel digital transcriptional profiling of single cells , 2016, Nature Communications.

[70]  D. Bauer Constructing Confidence Sets Using Rank Statistics , 1972 .

[71]  T. Hashimshony,et al.  CEL-Seq2-Single-Cell RNA Sequencing by Multiplexed Linear Amplification. , 2019, Methods in molecular biology.

[72]  Principal Investigators,et al.  Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris , 2018 .

[73]  J. Herman,et al.  Aberrant patterns of DNA methylation, chromatin formation and gene expression in cancer. , 2001, Human molecular genetics.

[74]  Richard Bonneau,et al.  High-definition spatial transcriptomics for in situ tissue profiling , 2019, Nature Methods.

[75]  Luca Scrucca,et al.  mclust 5: Clustering, Classification and Density Estimation Using Gaussian Finite Mixture Models , 2016, R J..

[76]  P. Linsley,et al.  MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data , 2015, Genome Biology.

[77]  Franziska Michor,et al.  Unravelling subclonal heterogeneity and aggressive disease states in TNBC through single-cell RNA-seq , 2018, Nature Communications.

[78]  Hongkai Ji,et al.  TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis , 2016, Nucleic acids research.

[79]  Valentine Svensson,et al.  Droplet scRNA-seq is not zero-inflated , 2019, Nature Biotechnology.

[80]  Qionghai Dai,et al.  Scalable analysis of cell-type composition from single-cell transcriptomics using deep recurrent learning , 2019, Nature Methods.

[81]  Wenhao Tang,et al.  bayNorm: Bayesian gene expression recovery, imputation and normalization for single-cell RNA-sequencing data , 2019, Bioinform..

[82]  Kevin R. Moon,et al.  Exploring single-cell data with deep multitasking neural networks , 2017, Nature Methods.

[83]  Alexey M. Kozlov,et al.  Eleven grand challenges in single-cell data science , 2020, Genome Biology.

[84]  I. Yanai,et al.  Integrating single-cell RNA-Seq with spatial transcriptomics in pancreatic ductal adenocarcinoma using multimodal intersection analysis , 2019 .

[85]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[86]  Raphael Gottardo,et al.  Orchestrating single-cell analysis with Bioconductor , 2019, Nature Methods.

[87]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[88]  C. Abate-Shen Deregulated homeobox gene expression in cancer: cause or consequence? , 2002, Nature Reviews Cancer.

[89]  P. Kharchenko,et al.  Bayesian approach to single-cell differential expression analysis , 2014, Nature Methods.

[90]  Wei Vivian Li,et al.  An accurate and robust imputation method scImpute for single-cell RNA-seq data , 2018, Nature Communications.

[91]  R H Hruban,et al.  Gene expression profiles in normal and cancer cells. , 1997, Science.

[92]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[93]  Hadley Wickham,et al.  ggplot2 - Elegant Graphics for Data Analysis (2nd Edition) , 2017 .