Pathway-Based Single-Cell RNA-Seq Classification, Clustering, and Construction of Gene-Gene Interactions Networks Using Random Forests

Single-cell RNA-Sequencing (scRNA-Seq), an advanced sequencing technique, enables biomedical researchers to characterize cell-specific gene expression profiles. Although studies have adapted machine learning algorithms to cluster different cell populations for scRNA-Seq data, few existing methods have utilized machine learning techniques to investigate functional pathways in classifying heterogeneous cell populations. As genes often work interactively at the pathway level, studying the cellular heterogeneity based on pathways can facilitate the interpretation of biological functions of different cell populations. In this paper, we propose a pathway-based analytic framework using Random Forests (RF) to identify discriminative functional pathways related to cellular heterogeneity as well as to cluster cell populations for scRNA-Seq data. We further propose a novel method to construct gene-gene interactions (GGIs) networks using RF that illustrates important GGIs in differentiating cell populations. The co-occurrence of genes in different discriminative pathways and ‘cross-talk’ genes connecting those pathways are also illustrated in our networks. Our novel pathway-based framework clusters cell populations, prioritizes important pathways, highlights GGIs and pivotal genes bridging cross-talked pathways, and groups co-functional genes in networks. These features allow biomedical researchers to better understand the functional heterogeneity of different cell populations and to pinpoint important genes driving heterogeneous cellular functions.

[1]  Jae Won Lee,et al.  An extensive comparison of recent classification tools applied to microarray data , 2004, Comput. Stat. Data Anal..

[2]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[3]  Rich Caruana,et al.  An empirical evaluation of supervised learning in high dimensions , 2008, ICML '08.

[4]  Hongyu Zhao,et al.  Pathway analysis using random forests classification and regression , 2006, Bioinform..

[5]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[6]  Maziyar Baran Pouyan,et al.  Random forest based similarity learning for single cell RNA sequencing data , 2018, bioRxiv.

[7]  N. Neff,et al.  Reconstructing lineage hierarchies of the distal lung epithelium using single cell RNA-seq , 2014, Nature.

[8]  Nicholas Pervolarakis,et al.  Tumour heterogeneity and metastasis at single-cell resolution , 2018, Nature Cell Biology.

[9]  Xiangtao Li,et al.  Single-Cell RNA Sequencing Data Interpretation by Evolutionary Multiobjective Clustering , 2020, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[10]  Minoru Kanehisa,et al.  KEGG: new perspectives on genomes, pathways, diseases and drugs , 2016, Nucleic Acids Res..

[11]  Constantin F. Aliferis,et al.  A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification , 2008, BMC Bioinformatics.

[12]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[13]  M. Schaub,et al.  SC3 - consensus clustering of single-cell RNA-Seq data , 2016, Nature Methods.

[14]  Shawn M. Gillespie,et al.  Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma , 2014, Science.

[15]  Lihua Zhang,et al.  Comparison of Computational Methods for Imputing Single-Cell RNA-Sequencing Data , 2020, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[16]  M. Cugmas,et al.  On comparing partitions , 2015 .

[17]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[18]  N. Hacohen,et al.  Single-cell RNA-seq reveals new types of human blood dendritic cells, monocytes, and progenitors , 2017, Science.

[19]  A. Oudenaarden,et al.  Validation of noise models for single-cell transcriptomics , 2014, Nature Methods.

[20]  Bo Wang,et al.  Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning , 2016, Nature Methods.

[21]  Aleksandra A. Kolodziejczyk,et al.  Single Cell RNA-Sequencing of Pluripotent States Unlocks Modular Transcriptional Variation , 2015, Cell stem cell.

[22]  Jong Kyoung Kim,et al.  Corrigendum: Characterizing noise structure in single-cell RNA-seq distinguishes genuine from technical stochastic allelic expression , 2015, Nature Communications.

[23]  Fabian J. Theis,et al.  Diffusion maps for high-dimensional single-cell analysis of differentiation data , 2015, Bioinform..

[24]  Gábor Csárdi,et al.  The igraph software package for complex network research , 2006 .

[25]  Ruiqiang Li,et al.  Single-cell RNA-Seq profiling of human preimplantation embryos and embryonic stem cells , 2013, Nature Structural &Molecular Biology.

[26]  Pradeep S Rajendran,et al.  Single-cell dissection of transcriptional heterogeneity in human colon tumors , 2011, Nature Biotechnology.

[27]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[28]  Hui Sun Leong,et al.  Longitudinal single-cell RNA sequencing of patient-derived primary cells reveals drug-induced infidelity in stem cell hierarchy , 2018, Nature Communications.

[29]  Catalin C. Barbacioru,et al.  mRNA-Seq whole-transcriptome analysis of a single cell , 2009, Nature Methods.

[30]  Christopher Yau,et al.  pcaReduce: hierarchical clustering of single cell transcriptional profiles , 2015, BMC Bioinformatics.

[31]  S. Horvath,et al.  Unsupervised Learning With Random Forest Predictors , 2006 .

[32]  R. L. Thorndike Who belongs in the family? , 1953 .

[33]  Sean C. Bendall,et al.  viSNE enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia , 2013, Nature Biotechnology.

[34]  Aleksandra A. Kolodziejczyk,et al.  Characterizing noise structure in single-cell RNA-seq distinguishes genuine from technical stochastic allelic expression , 2015, Nature Communications.

[35]  Mehrdad Nourani,et al.  Clustering Single-Cell Expression Data Using Random Forest Graphs , 2017, IEEE Journal of Biomedical and Health Informatics.

[36]  James B. Brown,et al.  Iterative random forests to discover predictive and stable high-order interactions , 2017, Proceedings of the National Academy of Sciences.

[37]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[38]  Max Kuhn,et al.  Building Predictive Models in R Using the caret Package , 2008 .

[39]  Fabian J Theis,et al.  Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells , 2015, Nature Biotechnology.