A Bayesian framework for inter-cellular information sharing improves dscRNA-seq quantification

Motivation Droplet based single cell RNA-seq (dscRNA-seq) data is being generated at an unprecedented pace, and the accurate estimation of gene level abundances for each cell is a crucial first step in most dscRNA-seq analyses. When preprocessing the raw dscRNA-seq data to generate a count matrix, care must be taken to account for the potentially large number of multi-mapping locations per read. The sparsity of dscRNA-seq data, and the strong 3’ sampling bias, makes it difficult to disambiguate cases where there is no uniquely mapping read to any of the candidate target genes. Results We introduce a Bayesian framework for information sharing across cells within a sample, or across multiple modalities of data using the same sample, to improve gene quantification estimates for dscRNA-seq data. We use an anchor-based approach to connect cells with similar gene expression patterns, and learn informative, empirical priors which we provide to alevin’s gene multi-mapping resolution algorithm. This improves the quantification estimates for genes with no uniquely mapping reads (i.e. when there is no unique intra-cellular information). We show our new model improves the per cell gene level estimates and provides a principled framework for information sharing across multiple modalities. We test our method on a combination of simulated and real datasets under various setups. Availability The information sharing model is included in alevin and is implemented in C++14. It is available as open-source software, under GPL v3, at https://github.com/COMBINE-lab/salmon as of version 1.1.0. Contact asrivastava@cs.stonybrook.edu, rob@cs.umd.edu

[1]  David van Dijk,et al.  Exploring single-cell data with deep multitasking neural networks , 2019, Nature Methods.

[2]  Kevin R. Moon,et al.  Recovering Gene Interactions from Single-Cell Data Using Data Diffusion , 2018, Cell.

[3]  Avi Srivastava,et al.  Alevin efficiently estimates accurate gene abundances from dscRNA-seq data , 2018, Genome Biology.

[4]  Y. Kluger,et al.  Zero-preserving imputation of scRNA-seq data using low-rank approximation , 2018, bioRxiv.

[5]  Grace X. Y. Zheng,et al.  Massively parallel digital transcriptional profiling of single cells , 2016, Nature Communications.

[6]  Xuegong Zhang,et al.  scRecover: Discriminating true and false zeros in single-cell RNA-seq data for imputation , 2019, bioRxiv.

[7]  P. Visscher,et al.  10 Years of GWAS Discovery: Biology, Function, and Translation. , 2017, American journal of human genetics.

[8]  Colin N. Dewey,et al.  Integrative analysis with ChIP-seq advances the limits of transcript quantification from RNA-seq , 2016, Genome research.

[9]  Wenhao Tang,et al.  bayNorm: Bayesian gene expression recovery, imputation and normalization for single-cell RNA-sequencing data , 2019, Bioinform..

[10]  Wei Vivian Li,et al.  An accurate and robust imputation method scImpute for single-cell RNA-seq data , 2018, Nature Communications.

[11]  Nancy R. Zhang,et al.  SAVER: Gene expression recovery for single-cell RNA sequencing , 2018, Nature Methods.

[12]  Xiang Zhou,et al.  VIPER: variability-preserving imputation for accurate gene expression recovery in single-cell RNA sequencing studies , 2018, Genome Biology.

[13]  Christoph Hafemeister,et al.  Comprehensive integration of single cell data , 2018, bioRxiv.

[14]  Jingshu Wang,et al.  Data denoising with transfer learning in single-cell transcriptomics , 2019, Nature Methods.

[15]  Antti Honkela,et al.  Fast and accurate approximate inference of transcript expression from RNA-seq data , 2014, Bioinform..

[16]  Colin N. Dewey,et al.  RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome , 2011, BMC Bioinformatics.

[17]  Qionghai Dai,et al.  Scalable analysis of cell-type composition from single-cell transcriptomics using deep recurrent learning , 2019, Nature Methods.

[18]  Masao Nagasaki,et al.  TIGAR: transcript isoform abundance estimation method with gapped alignment of RNA-Seq data by variational Bayesian inference , 2013, Bioinform..

[19]  Kevin R. Moon,et al.  Exploring single-cell data with deep multitasking neural networks , 2017, Nature Methods.

[20]  Ahmed Mahas,et al.  RNA virus interference via CRISPR/Cas13a system in plants , 2017, Genome Biology.

[21]  Kathryn Demanelis,et al.  Co-occurring expression and methylation QTLs allow detection of common causal variants and shared biological mechanisms , 2018, Nature Communications.

[22]  Evan Z. Macosko,et al.  Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets , 2015, Cell.

[23]  Lior Pachter,et al.  Modular and efficient pre-processing of single-cell RNA-seq , 2019, bioRxiv.

[24]  Allon M. Klein,et al.  Droplet Barcoding for Single-Cell Transcriptomics Applied to Embryonic Stem Cells , 2015, Cell.

[25]  Angshul Majumdar,et al.  AutoImpute: Autoencoder based imputation of single-cell RNA-seq data , 2018, Scientific Reports.

[26]  Rob Patro,et al.  Salmon provides fast and bias-aware quantification of transcript expression , 2017, Nature Methods.

[27]  Fabian J Theis,et al.  Single-cell RNA-seq denoising using a deep count autoencoder , 2019, Nature Communications.

[28]  Angshul Majumdar,et al.  McImpute: Matrix Completion Based Imputation for Single Cell RNA-seq Data , 2018, bioRxiv.

[29]  Richard Bonneau,et al.  Splotch: Robust estimation of aligned spatial temporal gene expression data , 2019, bioRxiv.

[30]  Lana X. Garmire,et al.  DeepImpute: an accurate, fast, and scalable deep neural network method to impute single-cell RNA-seq data , 2018, Genome Biology.

[31]  Florian Wagner,et al.  K-nearest neighbor smoothing for high-throughput single-cell RNA-Seq data , 2017, bioRxiv.

[32]  Philipp Thomas,et al.  bayNorm: Bayesian gene expression recovery, imputation and normalization for single-cell RNA-sequencing data , 2018, bioRxiv.

[33]  Rob Patro,et al.  Minnow: a principled framework for rapid simulation of dscRNA-seq data at the read level , 2019, Bioinform..

[34]  A. Regev,et al.  Spatial reconstruction of single-cell gene expression , 2015, Nature Biotechnology.

[35]  Masao Nagasaki,et al.  TIGAR2: sensitive and accurate estimation of transcript isoform expression with longer RNA-Seq reads , 2014, BMC Genomics.

[36]  Lihua Zhang,et al.  PBLR: an accurate single cell RNA-seq data imputation tool considering cell heterogeneity and prior expression level of dropouts , 2018 .

[37]  R. Irizarry,et al.  Missing data and technical variability in single‐cell RNA‐sequencing experiments , 2018, Biostatistics.

[38]  Michael I. Jordan,et al.  Deep Generative Modeling for Single-cell Transcriptomics , 2018, Nature Methods.

[39]  K. Holt,et al.  Performance of neural network basecalling tools for Oxford Nanopore sequencing , 2019, Genome Biology.

[40]  J. Maaskola,et al.  Spatially Resolved Transcriptomics Enables Dissection of Genetic Heterogeneity in Stage III Cutaneous Malignant Melanoma. , 2018, Cancer research.

[41]  Son K. Pham,et al.  Hera-T: an efficient and accurate approach for quantifying gene abundances from 10X-Chromium data with high rates of non-exonic reads , 2019, bioRxiv.

[42]  Il-Youp Kwak,et al.  DrImpute: imputing dropout events in single cell RNA sequencing data , 2017, BMC Bioinformatics.

[43]  S. Rajasekaran,et al.  Efficient and scalable scaffolding using optical restriction maps , 2014, BMC Genomics.

[44]  Wen-Chi Chou,et al.  A combined reference panel from the 1000 Genomes and UK10K projects improved rare variant imputation in European and Chinese samples , 2016, Scientific Reports.