Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model

Single cell RNA-Seq (scRNA-Seq) profiles gene expression of individual cells. Recent scRNA-Seq datasets have incorporated unique molecular identifiers (UMIs). Using negative controls, we show UMI counts follow multinomial sampling with no zero-inflation. Current normalization pro-cedures such as log of counts per million and feature selection by highly variable genes produce false variability in dimension reduction. We pro-pose simple multinomial methods, including generalized principal component analysis (GLM-PCA) for non-normal distributions, and feature selection using deviance. These methods outperform current practice in a downstream clustering assessment using ground-truth datasets.

[1]  E. Shapiro,et al.  Single-cell sequencing-based technologies will revolutionize whole-organism science , 2013, Nature Reviews Genetics.

[2]  P. Linsley,et al.  MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data , 2015, Genome Biology.

[3]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[4]  Charlotte Soneson,et al.  Observation weights unlock bulk RNA-seq tools for zero inflation and single-cell applications , 2018, Genome Biology.

[5]  Barbara E. Engelhardt,et al.  A robust nonlinear low-dimensional manifold for single cell RNA-seq data , 2018, BMC Bioinformatics.

[6]  Sayan Mukherjee,et al.  Naught all zeros in sequence count data are the same , 2018, bioRxiv.

[7]  Andrew J. Hill,et al.  Single-cell mRNA quantification and differential analysis with Census , 2017, Nature Methods.

[8]  Davis J. McCarthy,et al.  A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor , 2016, F1000Research.

[9]  G. Mateu-Figueras,et al.  Isometric Logratio Transformations for Compositional Data Analysis , 2003 .

[10]  S. Dudoit,et al.  A general and flexible method for signal extraction from single-cell RNA-seq data , 2018, Nature Communications.

[11]  R. Irizarry,et al.  Missing data and technical variability in single‐cell RNA‐sequencing experiments , 2018, Biostatistics.

[12]  Saumyadipta Pyne,et al.  A brief review of single-cell transcriptomic technologies. , 2018, Briefings in functional genomics.

[13]  Andrew D. Ellington,et al.  Synthetic evolutionary origin of a proofreading reverse transcriptase , 2016, Science.

[14]  Jean-Philippe Vert,et al.  ZINB-WaVE: A general and flexible method for signal extraction from single-cell RNA-seq data , 2017, bioRxiv.

[15]  Charlotte Soneson,et al.  A systematic performance evaluation of clustering methods for single-cell RNA-seq data , 2018, F1000Research.

[16]  Nir Yosef,et al.  Bayesian Inference for a Generative Model of Transcriptome Profiles from Single-cell RNA Sequencing , 2018, bioRxiv.

[17]  S. Teichmann,et al.  SpatialDE: identification of spatially variable genes , 2018, Nature Methods.

[18]  E. Pierson,et al.  ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression analysis , 2015, Genome Biology.

[19]  Barbara Di Camillo,et al.  How to design a single-cell RNA-sequencing experiment: pitfalls, challenges and perspectives. , 2018, Briefings in bioinformatics.

[20]  L. Pachter Models for transcript quantification from RNA-Seq , 2011, 1104.3889.

[21]  Matt Taddy,et al.  Distributed multinomial regression , 2013, 1311.6139.

[22]  M. Hemberg,et al.  Identifying cell populations with scRNASeq. , 2017, Molecular aspects of medicine.

[23]  David M. Blei,et al.  Scalable Recommendation with Poisson Factorization , 2013, ArXiv.

[24]  Sanjoy Dasgupta,et al.  A Generalization of Principal Components Analysis to the Exponential Family , 2001, NIPS.

[25]  Aaron T. L. Lun,et al.  Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R , 2017, Bioinform..

[26]  S. Baker The Multinomial‐Poisson Transformation , 1994 .

[27]  Aleksandra A. Kolodziejczyk,et al.  The technology and biology of single-cell RNA sequencing. , 2015, Molecular cell.

[28]  Paul Hoffman,et al.  Integrating single-cell transcriptomic data across different conditions, technologies, and species , 2018, Nature Biotechnology.

[29]  Stuart G. Baker,et al.  Multinomial–Poisson Transformation , 2006 .

[30]  D. McDonald,et al.  On the poisson approximation to the multinomial distribution , 1980 .

[31]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[32]  Haesun Park,et al.  Algorithms for nonnegative matrix and tensor factorizations: a unified view based on block coordinate descent framework , 2014, J. Glob. Optim..

[33]  Allon M. Klein,et al.  Droplet Barcoding for Single-Cell Transcriptomics Applied to Embryonic Stem Cells , 2015, Cell.

[34]  Charlotte Soneson,et al.  Bias, robustness and scalability in single-cell differential expression analysis , 2018, Nature Methods.

[35]  S. Teichmann,et al.  Exponential scaling of single-cell RNA-seq in the past decade , 2017, Nature Protocols.

[36]  Cole Trapnell,et al.  Single-cell transcriptome sequencing: recent advances and remaining challenges , 2016, F1000Research.

[37]  M. Cugmas,et al.  On comparing partitions , 2015 .

[38]  Joshua W. K. Ho,et al.  CIDR: Ultrafast and accurate clustering through imputation for single-cell RNA-seq data , 2016, Genome Biology.

[39]  Sandrine Dudoit,et al.  Normalizing single-cell RNA sequencing data: challenges and opportunities , 2017, Nature Methods.

[40]  Florian Wagner,et al.  K-nearest neighbor smoothing for high-throughput single-cell RNA-Seq data , 2017, bioRxiv.

[41]  Grace X. Y. Zheng,et al.  Massively parallel digital transcriptional profiling of single cells , 2016, Nature Communications.

[42]  Aleksandra A. Kolodziejczyk,et al.  Accounting for technical noise in single-cell RNA-seq experiments , 2013, Nature Methods.

[43]  Gioele La Manno,et al.  Quantitative single-cell RNA-seq with unique molecular identifiers , 2013, Nature Methods.

[44]  A. Agresti Foundations of Linear and Generalized Linear Models , 2015 .

[45]  D. Warton Why you cannot transform your way out of trouble for small counts , 2018, Biometrics.

[46]  M. Robinson,et al.  A systematic performance evaluation of clustering methods for single-cell RNA-seq data. , 2018, F1000Research.

[47]  Cole Trapnell,et al.  The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells , 2014, Nature Biotechnology.

[48]  Martin Hemberg,et al.  M3Drop: dropout-based feature selection for scRNASeq , 2018, Bioinform..

[49]  Barbara Di Camillo,et al.  How to design a single-cell RNA-sequencing experiment: pitfalls, challenges and perspectives , 2019, Briefings Bioinform..

[50]  Evan Z. Macosko,et al.  Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets , 2015, Cell.

[51]  David A. Knowles,et al.  Batch effects and the effective design of single-cell gene expression studies , 2016, bioRxiv.

[52]  David A. Knowles,et al.  Batch effects and the effective design of single-cell gene expression studies , 2016, Scientific Reports.

[53]  Åsa K. Björklund,et al.  Smart-seq2 for sensitive full-length transcriptome profiling in single cells , 2013, Nature Methods.

[54]  Aaron Lun,et al.  Overcoming systematic errors caused by log-transformation of normalized single-cell RNA sequencing data , 2018, bioRxiv.

[55]  J. Marioni,et al.  Pooling across cells to normalize single-cell RNA sequencing data with many zero counts , 2016, Genome Biology.