SimCD: Simultaneous Clustering and Differential expression analysis for single-cell transcriptomic data

Single-Cell RNA sequencing (scRNA-seq) measurements have facilitated genome-scale transcriptomic profiling of individual cells, with the hope of deconvolving cellular dynamic changes in corresponding cell sub-populations to better understand molecular mechanisms of different development processes. Several scRNA-seq analysis methods have been proposed to first identify cell sub-populations by clustering and then separately perform differential expression analysis to understand gene expression changes. Their corresponding statistical models and inference algorithms are often designed disjointly. We develop a new method—SimCD—that explicitly models cell heterogeneity and dynamic differential changes in one unified hierarchical gamma-negative binomial (hGNB) model, allowing simultaneous cell clustering and differential expression analysis for scRNAseq data. Our method naturally defines cell heterogeneity by dynamic expression changes, which is expected to help achieve better performances on the two tasks compared to the existing methods that perform them separately. In addition, SimCD better models dropout (zero inflation) in scRNA-seq data by both celland gene-level factors and obviates the need for sophisticated pre-processing steps such as normalization, thanks to the direct modeling of scRNA-seq count data by the rigorous hGNB model with an efficient Gibbs sampling inference algorithm. Extensive comparisons with the state-ofthe-art methods on both simulated and real-world scRNA-seq count data demonstrate the capability of SimCD to discover cell clusters and capture dynamic expression changes. Furthermore, SimCD helps identify several known genes affected by food deprivation in hypothalamic neuron cell subtypes as well as some new potential markers, suggesting the capability of SimCD for bio-marker discovery. SimCD is implemented in R and is available at https://github.com/namini94/SimCD ar X iv :2 10 4. 01 51 2v 1 [ qbi o. G N ] 4 A pr 2 02 1 A PREPRINT APRIL 6, 2021

[1]  N. Ling,et al.  Hypothalamic regulation of growth hormone secretion during food deprivation in the rat. , 1993, Life sciences.

[2]  Sandrine Dudoit,et al.  Performance Assessment and Selection of Normalization Procedures for Single-Cell RNA-Seq , 2017 .

[3]  Il-Youp Kwak,et al.  DrImpute: imputing dropout events in single cell RNA sequencing data , 2017, BMC Bioinformatics.

[4]  Geoffrey E. Hinton,et al.  Stochastic Neighbor Embedding , 2002, NIPS.

[5]  Grace X. Y. Zheng,et al.  Massively parallel digital transcriptional profiling of single cells , 2016, Nature Communications.

[6]  Edward R. Dougherty,et al.  Optimal Bayesian supervised domain adaptation for RNA sequencing data , 2021, Bioinform..

[7]  Sandrine Dudoit,et al.  GC-Content Normalization for RNA-Seq Data , 2011, BMC Bioinformatics.

[8]  Xiaoke Ma,et al.  Joint learning dimension reduction and clustering of single-cell RNA-sequencing data , 2020, Bioinform..

[9]  Lingling An,et al.  Normalization Methods on Single-Cell RNA-seq Data: An Empirical Survey , 2020, Frontiers in Genetics.

[10]  J. Davis Univariate Discrete Distributions , 2006 .

[11]  Yi Zhang,et al.  Single-Cell RNA-Seq Reveals Hypothalamic Cell Diversity. , 2017, Cell reports.

[12]  S. Linnarsson,et al.  Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq , 2015, Science.

[13]  H. Berthoud,et al.  Galanin-Expressing GABA Neurons in the Lateral Hypothalamus Modulate Food Reward and Noncompulsive Locomotion , 2017, The Journal of Neuroscience.

[14]  Kieran R. Campbell,et al.  Uncovering pseudotemporal trajectories with covariates from single cell and bulk expression data , 2018, Nature Communications.

[15]  Srinivasan Parthasarathy,et al.  Identifying functional modules in interaction networks through overlapping Markov clustering , 2012, Bioinform..

[16]  Leila Noorbala,et al.  Development of Phase Congruency to Estimate the Direction of Maximum Information (tDMI) in Images with Straight Line Segments , 2019, 2019 27th Iranian Conference on Electrical Engineering (ICEE).

[17]  Mohan Bolisetty,et al.  Single-cell transcriptomic analysis of the lateral hypothalamic area reveals molecularly distinct populations of inhibitory and excitatory neurons , 2019, Nature Neuroscience.

[18]  Xuegong Zhang,et al.  DEsingle for detecting three types of differential expression in single-cell RNA-seq data , 2017, bioRxiv.

[19]  Samuel Kaski,et al.  Bayesian Canonical correlation analysis , 2013, J. Mach. Learn. Res..

[20]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[21]  S. Appleyard Appetite Regulation, Neuronal Control , 2003 .

[22]  Charlotte Soneson,et al.  Observation weights unlock bulk RNA-seq tools for zero inflation and single-cell applications , 2018, Genome Biology.

[23]  R. Brogan,et al.  Effects of food deprivation on the GH axis: immunocytochemical and molecular analysis. , 1997, Neuroendocrinology.

[24]  Lawrence Carin,et al.  Negative Binomial Process Count and Mixture Modeling , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Siamak Zamani Dadaneh,et al.  BNP-Seq: Bayesian Nonparametric Differential Expression Analysis of Sequencing Count Data , 2016, 1608.03991.

[26]  Xiaoning Qian,et al.  Bayesian gamma-negative binomial modeling of single-cell RNA sequencing data , 2019, BMC Genomics.

[27]  S. Dudoit,et al.  A general and flexible method for signal extraction from single-cell RNA-seq data , 2018, Nature Communications.

[28]  Rona S. Gertner,et al.  Single cell RNA Seq reveals dynamic paracrine control of cellular variation , 2014, Nature.

[29]  David B. Dunson,et al.  Lognormal and Gamma Mixed Negative Binomial Regression , 2012, ICML.

[30]  Hector Roux de Bézieux,et al.  Trajectory-based differential expression analysis for single-cell sequencing data , 2019, Nature Communications.

[31]  Tianyu Wang,et al.  SigEMD: A powerful method for differential gene expression analysis in single-cell RNA sequencing data. , 2018, Methods.

[32]  Characterization of the hypothalamic transcriptome in response to food deprivation reveals global changes in long noncoding RNA, and cell cycle response genes , 2015, Genes & Nutrition.

[33]  Y. Pawitan,et al.  Reproducibility of Methods to Detect Differentially Expressed Genes from Single-Cell RNA Sequencing , 2020, Frontiers in Genetics.

[34]  Gary A. Churchill,et al.  Bayesian model selection reveals biological origins of zero inflation in single-cell transcriptomics , 2020, Genome Biology.

[35]  Andrew E. Jaffe,et al.  Bioinformatics Applications Note Gene Expression the Sva Package for Removing Batch Effects and Other Unwanted Variation in High-throughput Experiments , 2022 .

[36]  Hannah M. Batchelor,et al.  Lateral Hypothalamic Neurotensin Neurons Orchestrate Dual Weight Loss Behaviors via Distinct Mechanisms. , 2017, Cell reports.

[37]  Sandrine Dudoit,et al.  Bioconductor workflow for single-cell RNA sequencing: Normalization, dimensionality reduction, clustering, and lineage inference , 2017, F1000Research.

[38]  James G. Scott,et al.  Bayesian Inference for Logistic Models Using Pólya–Gamma Latent Variables , 2012, 1205.0310.

[39]  Michael I. Jordan,et al.  Deep Generative Modeling for Single-cell Transcriptomics , 2018, Nature Methods.

[40]  S. Leibowitz,et al.  Galanin: stimulation of feeding induced by medial hypothalamic injection of this novel peptide. , 1986, European journal of pharmacology.