bayNorm: Bayesian gene expression recovery, imputation and normalization for single-cell RNA-sequencing data

Abstract Motivation Normalization of single-cell RNA-sequencing (scRNA-seq) data is a prerequisite to their interpretation. The marked technical variability, high amounts of missing observations and batch effect typical of scRNA-seq datasets make this task particularly challenging. There is a need for an efficient and unified approach for normalization, imputation and batch effect correction. Results Here, we introduce bayNorm, a novel Bayesian approach for scaling and inference of scRNA-seq counts. The method’s likelihood function follows a binomial model of mRNA capture, while priors are estimated from expression values across cells using an empirical Bayes approach. We first validate our assumptions by showing this model can reproduce different statistics observed in real scRNA-seq data. We demonstrate using publicly available scRNA-seq datasets and simulated expression data that bayNorm allows robust imputation of missing values generating realistic transcript distributions that match single molecule fluorescence in situ hybridization measurements. Moreover, by using priors informed by dataset structures, bayNorm improves accuracy and sensitivity of differential expression analysis and reduces batch effect compared with other existing methods. Altogether, bayNorm provides an efficient, integrated solution for global scaling normalization, imputation and true count recovery of gene expression measurements from scRNA-seq data. Availability and implementation The R package ‘bayNorm’ is publishd on bioconductor at https://bioconductor.org/packages/release/bioc/html/bayNorm.html. The code for analyzing data in this article is available at https://github.com/WT215/bayNorm_papercode. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[2]  Anne Condon,et al.  Interpretable dimensionality reduction of single cell transcriptome data with deep generative models , 2017, Nature Communications.

[3]  Chengzhong Ye,et al.  DECENT: differential expression with capture efficiency adjustmeNT for single-cell RNA-seq data , 2019, Bioinform..

[4]  R. Irizarry,et al.  Missing data and technical variability in single‐cell RNA‐sequencing experiments , 2018, Biostatistics.

[5]  Maria K. Jaakkola,et al.  Comparison of methods to detect differentially expressed genes between single-cell populations , 2016, Briefings Bioinform..

[6]  L.L. Elo,et al.  Reproducibility-Optimized Test Statistic for Ranking Genes in Microarray Studies , 2008, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[7]  Jin Gu,et al.  VASC: Dimension Reduction and Visualization of Single-cell RNA-seq Data by Deep Variational Autoencoder , 2018, Genom. Proteom. Bioinform..

[8]  Laleh Haghverdi,et al.  Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors , 2018, Nature Biotechnology.

[9]  S. Linnarsson,et al.  Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq , 2015, Science.

[10]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[11]  Kevin R. Moon,et al.  Recovering Gene Interactions from Single-Cell Data Using Data Diffusion , 2018, Cell.

[12]  A. Heger,et al.  UMI-tools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy , 2016, bioRxiv.

[13]  Michael I. Jordan,et al.  Deep Generative Modeling for Single-cell Transcriptomics , 2018, Nature Methods.

[14]  Christoph Ziegenhain,et al.  zUMIs - A fast and flexible pipeline to process RNA sequencing data with UMIs , 2017, bioRxiv.

[15]  S. Teichmann,et al.  From Tissues to Cell Types and Back: Single-Cell Gene Expression Analysis of Tissue Architecture , 2018, Annual Review of Biomedical Data Science.

[16]  Rhonda Bacher,et al.  Design and computational analysis of single-cell RNA-sequencing experiments , 2016, Genome Biology.

[17]  Aaron Lun,et al.  Overcoming systematic errors caused by log-transformation of normalized single-cell RNA sequencing data , 2018, bioRxiv.

[18]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[19]  V. Shahrezaei,et al.  Single-cell imaging and RNA sequencing reveal patterns of gene expression heterogeneity during fission yeast growth and adaptation , 2019, Nature Microbiology.

[20]  Fabian J Theis,et al.  Single-cell RNA-seq denoising using a deep count autoencoder , 2019, Nature Communications.

[21]  Mark D. Robinson,et al.  Moderated statistical tests for assessing differences in tag abundance , 2007, Bioinform..

[22]  M. Newton,et al.  SCnorm: robust normalization of single-cell RNA-seq data , 2017, Nature Methods.

[23]  Nancy R. Zhang,et al.  SAVER: Gene expression recovery for single-cell RNA sequencing , 2018, Nature Methods.

[24]  Charlotte Soneson,et al.  Bias, robustness and scalability in single-cell differential expression analysis , 2018, Nature Methods.

[25]  Lior Pachter,et al.  Differential analysis of RNA-seq incorporating quantification uncertainty , 2016, Nature Methods.

[26]  P. Linsley,et al.  MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data , 2015, Genome Biology.

[27]  P. Kharchenko,et al.  Bayesian approach to single-cell differential expression analysis , 2014, Nature Methods.

[28]  Casper Kaae Sønderby,et al.  scVAE: variational auto-encoders for single-cell gene expression data , 2020, Bioinform..

[29]  Sandhya Prabhakaran,et al.  Dirichlet Process Mixture Model for Correcting Technical Variation in Single-Cell Gene Expression Data , 2016, ICML.

[30]  Ambrose J. Carr,et al.  Single-Cell Map of Diverse Immune Phenotypes in the Breast Tumor Microenvironment , 2018, Cell.

[31]  Wolfgang Huber,et al.  Differential expression of RNA-Seq data at the gene level – the DESeq package , 2012 .

[32]  F. W. Townes,et al.  Missing Data and Technical Variability in Single-Cell RNA-Sequencing Experiments , 2017, bioRxiv.

[33]  J. Marioni,et al.  Pooling across cells to normalize single-cell RNA sequencing data with many zero counts , 2016, Genome Biology.

[34]  Allon M. Klein,et al.  Droplet Barcoding for Single-Cell Transcriptomics Applied to Embryonic Stem Cells , 2015, Cell.

[35]  Aleksandra A. Kolodziejczyk,et al.  Accounting for technical noise in single-cell RNA-seq experiments , 2013, Nature Methods.

[36]  D. Cacchiarelli,et al.  Characterization of directed differentiation by high-throughput single-cell RNA-Seq , 2014, bioRxiv.

[37]  Sydney M. Shaffer,et al.  Rare Cell Detection by Single-Cell RNA Sequencing as Guided by Single-Molecule RNA FISH. , 2018, Cell systems.

[38]  Catalina A. Vallejos,et al.  BASiCS: Bayesian Analysis of Single-Cell Sequencing Data , 2015, PLoS Comput. Biol..

[39]  Sandrine Dudoit,et al.  Normalizing single-cell RNA sequencing data: challenges and opportunities , 2017, Nature Methods.

[40]  Florian Wagner,et al.  K-nearest neighbor smoothing for high-throughput single-cell RNA-Seq data , 2017, bioRxiv.

[41]  Thomas J. Hardcastle,et al.  baySeq: Empirical Bayesian methods for identifying differential expression in sequence count data , 2010, BMC Bioinformatics.

[42]  A. Oshlack,et al.  Splatter: simulation of single-cell RNA sequencing data , 2017, Genome Biology.

[43]  Christoph Ziegenhain,et al.  Quantitative single-cell transcriptomics , 2018, Briefings in functional genomics.

[44]  Chengzhong Ye,et al.  DECENT: differential expression with capture efficiency adjustmeNT for single-cell RNA-seq data , 2017, bioRxiv.

[45]  D. Tranchina,et al.  Stochastic mRNA Synthesis in Mammalian Cells , 2006, PLoS biology.

[46]  David A. Knowles,et al.  Batch effects and the effective design of single-cell gene expression studies , 2016, Scientific Reports.

[47]  Valentine Svensson,et al.  Droplet scRNA-seq is not zero-inflated , 2019, Nature Biotechnology.

[48]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[49]  Paul Hoffman,et al.  Integrating single-cell transcriptomic data across different conditions, technologies, and species , 2018, Nature Biotechnology.

[50]  E. Pierson,et al.  ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression analysis , 2015, Genome Biology.

[51]  Mark D. Robinson,et al.  Bias, robustness and scalability in differential expression analysis of single-cell RNA-seq data , 2017, bioRxiv.

[52]  Vahid Shahrezaei,et al.  Analytical distributions for stochastic gene expression , 2008, Proceedings of the National Academy of Sciences.

[53]  Wei Vivian Li,et al.  An accurate and robust imputation method scImpute for single-cell RNA-seq data , 2018, Nature Communications.

[54]  Yi Zhang,et al.  Single-Cell RNA-Seq Reveals Hypothalamic Cell Diversity. , 2017, Cell reports.

[55]  M. Robinson,et al.  A scaling normalization method for differential expression analysis of RNA-seq data , 2010, Genome Biology.

[56]  Joseph G Ibrahim,et al.  Heavy-tailed prior distributions for sequence count data: removing the noise and preserving large differences , 2018, bioRxiv.

[57]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[58]  James Hicks,et al.  Unravelling biology and shifting paradigms in cancer with single-cell sequencing , 2017, Nature Reviews Cancer.

[59]  Samuel L. Wolock,et al.  A Single-Cell Transcriptomic Map of the Human and Mouse Pancreas Reveals Inter- and Intra-cell Population Structure. , 2016, Cell systems.

[60]  M. Hemberg,et al.  scmap: projection of single-cell RNA-seq data across data sets , 2018, Nature Methods.

[61]  S. Linnarsson,et al.  Characterization of the single-cell transcriptional landscape by highly multiplex RNA-seq. , 2011, Genome research.

[62]  Nir Yosef,et al.  Bayesian Inference for a Generative Model of Transcriptome Profiles from Single-cell RNA Sequencing , 2018, bioRxiv.