Single-cell RNA-seq denoising using a deep count autoencoder

Single-cell RNA sequencing (scRNA-seq) has enabled researchers to study gene expression at a cellular resolution. However, noise due to amplification and dropout may obstruct analyses, so scalable denoising methods for increasingly large but sparse scRNAseq data are needed. We propose a deep count autoencoder network (DCA) to denoise scRNA-seq datasets. DCA takes the count distribution, overdispersion and sparsity of the data into account using a zero-inflated negative binomial noise model, and nonlinear gene-gene or gene-dispersion interactions are captured. Our method scales linearly with the number of cells and can therefore be applied to datasets of millions of cells. We demonstrate that DCA denoising improves a diverse set of typical scRNA-seq data analyses using simulated and real datasets. DCA outperforms existing methods for data imputation in quality and speed, enhancing biological discovery.

[1]  Casey S. Greene,et al.  ADAGE analysis of publicly available gene expression data collections illuminates Pseudomonas aeruginosa-host interactions , 2015, bioRxiv.

[2]  François Chollet,et al.  Deep Learning with Python , 2017 .

[3]  Fabian J Theis,et al.  Single cells make big data: New challenges and opportunities in transcriptomics , 2017 .

[4]  Fabian J Theis,et al.  Diffusion pseudotime robustly reconstructs lineage branching , 2016, Nature Methods.

[5]  Bo Ding,et al.  Normalization and noise reduction for single cell RNA-seq experiments , 2015, Bioinform..

[6]  R. Stewart,et al.  Single-cell RNA-seq reveals novel regulators of human embryonic stem cell differentiation to definitive endoderm , 2016, Genome Biology.

[7]  Martín Abadi,et al.  TensorFlow: learning functions at scale , 2016, ICFP.

[8]  A. Oudenaarden,et al.  239Single-cell sequencing of the healthy and diseased heart reveals Ckap4 as a new modulator of fibroblasts activation , 2018 .

[9]  Catalina A. Vallejos,et al.  BASiCS: Bayesian Analysis of Single-Cell Sequencing Data , 2015, PLoS Comput. Biol..

[10]  L. Shao,et al.  From Heuristic Optimization to Dictionary Learning: A Review and Comprehensive Comparison of Image Denoising Algorithms , 2014, IEEE Transactions on Cybernetics.

[11]  Grace X. Y. Zheng,et al.  Massively parallel digital transcriptional profiling of single cells , 2016, bioRxiv.

[12]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[13]  Anne Condon,et al.  Interpretable dimensionality reduction of single cell transcriptome data with deep generative models , 2017, Nature Communications.

[14]  Wei Vivian Li,et al.  scImpute: accurate and robust imputation for single cell RNA-seq data , 2017, bioRxiv.

[15]  Dongfang Wang,et al.  VASC: dimension reduction and visualization of single cell RNA sequencing data by deep variational autoencoder , 2017, bioRxiv.

[16]  Evan Z. Macosko,et al.  Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets , 2015, Cell.

[17]  L. Zon,et al.  Hematopoiesis: An Evolving Paradigm for Stem Cell Biology , 2008, Cell.

[18]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[19]  Allon M. Klein,et al.  Droplet Barcoding for Single-Cell Transcriptomics Applied to Embryonic Stem Cells , 2015, Cell.

[20]  David van Dijk,et al.  Manifold learning-based methods for analyzing single-cell RNA-sequencing data , 2018 .

[21]  Chris Eliasmith,et al.  Hyperopt: a Python library for model selection and hyperparameter optimization , 2015 .

[22]  Kevin R. Moon,et al.  MAGIC: A diffusion-based imputation method reveals gene-gene interactions in single-cell RNA-sequencing data , 2017, bioRxiv.

[23]  I. Amit,et al.  A Unique Microglia Type Associated with Restricting Development of Alzheimer’s Disease , 2017, Cell.

[24]  Casey S. Greene,et al.  Extracting a Biologically Relevant Latent Space from Cancer Transcriptomes with Variational Autoencoders , 2017, bioRxiv.

[25]  Pei-Rong Wang,et al.  Targeting SOX17 in human embryonic stem cells creates unique strategies for isolating and analyzing developing endoderm. , 2011, Cell stem cell.

[26]  H. Swerdlow,et al.  Large-scale simultaneous measurement of epitopes and transcriptomes in single cells , 2017, Nature Methods.

[27]  Nikhil Ketkar,et al.  Deep Learning with Python , 2017 .

[28]  Fabian J Theis,et al.  Hierarchical Differentiation of Myeloid Progenitors Is Encoded in the Transcription Factor Network , 2011, PloS one.

[29]  Fabian J Theis,et al.  Decoding the Regulatory Network for Blood Development from Single-Cell Gene Expression Measurements , 2015, Nature Biotechnology.

[30]  R. Irizarry,et al.  Missing data and technical variability in single‐cell RNA‐sequencing experiments , 2018, Biostatistics.

[31]  Geoffrey E. Hinton,et al.  Visualizing non-metric similarities in multiple maps , 2011, Machine Learning.

[32]  Hong-Bin Shen,et al.  IPMiner: hidden ncRNA-protein interaction sequential pattern mining with stacked autoencoder for accurate computational prediction , 2016, BMC Genomics.

[33]  I. Amit,et al.  Transcriptional Heterogeneity and Lineage Commitment in Myeloid Progenitors , 2015, Cell.

[34]  Joseph T. Roland,et al.  Unsupervised Trajectory Analysis of Single-Cell RNA-Seq and Imaging Data Reveals Alternative Tuft Cell Origins in the Gut. , 2017, Cell systems.

[35]  Stef van Buuren,et al.  Flexible Imputation of Missing Data , 2012 .

[36]  S. Teichmann,et al.  Computational and analytical challenges in single-cell transcriptomics , 2015, Nature Reviews Genetics.

[37]  Fabian J. Theis,et al.  Diffusion maps for high-dimensional single-cell analysis of differentiation data , 2015, Bioinform..

[38]  A. Oshlack,et al.  Splatter: simulation of single-cell RNA sequencing data , 2017, Genome Biology.

[39]  Grace X. Y. Zheng,et al.  Massively parallel digital transcriptional profiling of single cells , 2016, Nature Communications.

[40]  A. van Oudenaarden,et al.  Single-Cell Sequencing of the Healthy and Diseased Heart Reveals Cytoskeleton-Associated Protein 4 as a New Modulator of Fibroblasts Activation , 2018, Circulation.

[41]  Fabian J Theis,et al.  Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells , 2015, Nature Biotechnology.

[42]  L. Hillier,et al.  The time-resolved transcriptome of C. elegans , 2016, Genome research.

[43]  Xun Zhu,et al.  Granatum: a graphical single-cell RNA-Seq analysis pipeline for genomics scientists , 2017, Genome Medicine.

[44]  Andrew Butler,et al.  Single-cell RNA-seq of rheumatoid arthritis synovial tissue using low-cost microfluidic instrumentation , 2018, Nature Communications.

[45]  Mingxiang Teng,et al.  On the widespread and critical impact of systematic bias and batch effects in single-cell RNA-Seq data , 2015 .

[46]  Fabian J Theis,et al.  SCANPY: large-scale single-cell gene expression data analysis , 2018, Genome Biology.

[47]  Aleksandra A. Kolodziejczyk,et al.  Accounting for technical noise in single-cell RNA-seq experiments , 2013, Nature Methods.

[48]  Jingshu Wang,et al.  Gene expression recovery for single cell RNA sequencing , 2017, bioRxiv.

[49]  Ben Lehner,et al.  The effects of genetic variation on gene expression dynamics during development , 2013, Nature.