Penalized Latent Dirichlet Allocation Model in Single-Cell RNA Sequencing

Single-cell RNA sequencing (scRNA-seq) quantifies RNA transcripts at individual cell level, providing cellular-level resolution of gene expression variation. The scRNA-seq data are counts of RNA transcripts of all genes in species’ genome, which are of very high dimension and contain excessive zero counts. In order to better reduce the data dimension and extract robust and interpretable biological information, we develop a penalized Latent Dirichlet Allocation (pLDA) model for scRNA-seq data. The method is adapted from the generative probabilistic model LDA originated in natural language processing. pLDA models the scRNA-seq data by considering genes as words, cells as documents, and latent biological functions as topics. It imposes a penalty to reflect the characteristics in scRNA-seq that only a small subset of genes are expected to be topic-specific, which increases the robustness of the estimation and interpretability of the results. We apply pLDA to scRNA-seq datasets from both Drop-seq and SMARTer v1 technologies, and demonstrate improved performances in cell-type classification. The topics identified by pLDA are interpretable with biological functions.

[1]  M. Stephens,et al.  Visualizing the structure of RNA-seq expression data using grade of membership models , 2017, PLoS genetics.

[2]  Bo Wang,et al.  Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning , 2016, Nature Methods.

[3]  P. Kincade,et al.  Cutting Edge: CD19+ Pro-B Cells Can Give Rise to Dendritic Cells In Vitro , 1998, The Journal of Immunology.

[4]  S. Dudoit,et al.  A general and flexible method for signal extraction from single-cell RNA-seq data , 2018, Nature Communications.

[5]  W. Koh,et al.  Single-cell genome sequencing: current state of the science , 2016, Nature Reviews Genetics.

[6]  D T Severson,et al.  BEARscc determines robustness of single-cell clusters using simulated technical replicates , 2017, Nature Communications.

[7]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[8]  J. George,et al.  Single-cell transcriptomes identify human islet cell signatures and reveal cell-type–specific expression changes in type 2 diabetes , 2017, Genome research.

[9]  E. Shapiro,et al.  Single-cell sequencing-based technologies will revolutionize whole-organism science , 2013, Nature Reviews Genetics.

[10]  Jie Liu,et al.  Capturing cell type-specific chromatin structural patterns by applying topic modeling to single-cell Hi-C data , 2019, bioRxiv.

[11]  Yi Zhang,et al.  Two-phase differential expression analysis for single cell RNA-seq , 2018, Bioinform..

[12]  E. Pierson,et al.  ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression analysis , 2015, Genome Biology.

[13]  Martin Wattenberg,et al.  How to Use t-SNE Effectively , 2016 .

[14]  Joshua W. K. Ho,et al.  CIDR: Ultrafast and accurate clustering through imputation for single-cell RNA-seq data , 2016, Genome Biology.

[15]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[16]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[17]  Ira Mellman,et al.  Dendritic Cells Specialized and Regulated Antigen Processing Machines , 2001, Cell.

[18]  Krishna R. Kalari,et al.  Beta-Poisson model for single-cell RNA-seq data analyses , 2016, Bioinform..

[19]  Koji Tsuda,et al.  CellTree: an R/bioconductor package to infer the hierarchical structure of cell populations from single-cell RNA-seq data , 2016, BMC Bioinformatics.

[20]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[21]  Grace X. Y. Zheng,et al.  Massively parallel digital transcriptional profiling of single cells , 2016, Nature Communications.

[22]  Sean C. Bendall,et al.  Single-Cell Trajectory Detection Uncovers Progression and Regulatory Coordination in Human B Cell Development , 2014, Cell.

[23]  Stein Aerts,et al.  cisTopic: cis-regulatory topic modeling on single-cell ATAC-seq data , 2019, Nature Methods.

[24]  R. Sandberg,et al.  Full-Length mRNA-Seq from single cell levels of RNA and individual circulating tumor cells , 2012, Nature Biotechnology.

[25]  M. Robinson,et al.  A systematic performance evaluation of clustering methods for single-cell RNA-seq data. , 2018, F1000Research.

[26]  Cole Trapnell,et al.  The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells , 2014, Nature Biotechnology.

[27]  A. Regev,et al.  Spatial reconstruction of single-cell gene expression , 2015, Nature Biotechnology.

[28]  P. Linsley,et al.  MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data , 2015, Genome Biology.

[29]  M. Schaub,et al.  SC3 - consensus clustering of single-cell RNA-Seq data , 2016, Nature Methods.

[30]  Michael L. Raymer,et al.  Latent Dirichlet Allocation for Classification using Gene Expression Data , 2017, 2017 IEEE 17th International Conference on Bioinformatics and Bioengineering (BIBE).

[31]  Nikolaus Rajewsky,et al.  The Drosophila embryo at single-cell transcriptome resolution , 2017, Science.

[32]  David W. Nauen,et al.  Single-Cell RNA-Seq with Waterfall Reveals Molecular Cascades underlying Adult Neurogenesis. , 2015, Cell stem cell.

[33]  Hongkai Ji,et al.  TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis , 2016, Nucleic acids research.

[34]  Paul Hoffman,et al.  Integrating single-cell transcriptomic data across different conditions, technologies, and species , 2018, Nature Biotechnology.

[35]  E. Marco,et al.  Bifurcation analysis of single-cell gene expression data reveals epigenetic landscape , 2014, Proceedings of the National Academy of Sciences.

[36]  Åsa K. Björklund,et al.  Smart-seq2 for sensitive full-length transcriptome profiling in single cells , 2013, Nature Methods.

[37]  Xin Chen,et al.  Probabilistic topic modeling for genomic data interpretation , 2010, 2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[38]  Evan Z. Macosko,et al.  Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets , 2015, Cell.