A Gene Rank Based Approach for Single Cell Similarity Assessment and Clustering

One of the current research directions for single-cell RNA sequencing data is to accurately identify different cell types through unsupervised clustering methods. However, scRNA-seq data analysis is challenging because of their high noise, high dimensionality and sparsity. Moreover, the impact of multiple latent factors on gene expression heterogeneity and on the ability to accurately identify cell types remains unclear. How to overcome these challenges to reveal the true between-cell difference has become the key to the analysis of scRNA-seq data. For these reasons, unsupervised learning for cell populations discovery based on scRNA-seq data analysis has become an important research area. A cell similarity assessment method is the key to accurately identify cell types. Here, we present BioRank, a new cell similarity assessment method that using annotated gene sets and gene rank. In order to evaluate the performances, we cluster cells by two classical clustering algorithms based on the similarity between cells obtained by BioRank. BioRank can be used by any clustering algorithm that requires a similarity matrix. Applying BioRank to twelve published scRNA-seq datasets, the results show that our method is better than or at least as well as several popular similarity assessment methods and single cell clustering methods.

[1]  L. Hood,et al.  Highly accurate two-gene classifier for differentiating gastrointestinal stromal tumors and leiomyosarcomas , 2007, Proceedings of the National Academy of Sciences.

[2]  Ian T. Jolliffe,et al.  Principal Component Analysis for Special Types of Data , 1986 .

[3]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[4]  R. Sandberg,et al.  Single-Cell RNA-Seq Reveals Dynamic, Random Monoallelic Gene Expression in Mammalian Cells , 2014, Science.

[5]  P. Rousseeuw,et al.  Partitioning Around Medoids (Program PAM) , 2008 .

[6]  Magdalena Niewiadomska-Bugaj,et al.  Association of zero-inflated continuous variables , 2015 .

[7]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[9]  Burak Dura,et al.  Single-cell microRNA-mRNA co-sequencing reveals non-genetic heterogeneity and mechanisms of microRNA regulation , 2019, Nature Communications.

[10]  Vipin Kumar,et al.  Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data , 2003, SDM.

[11]  Lin Wu,et al.  CytoCtrlAnalyser: a Cytoscape app for biomolecular network controllability analysis , 2018, Bioinform..

[12]  Bo Wang,et al.  Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning , 2016, Nature Methods.

[13]  Gene W. Yeo,et al.  Single-Cell Alternative Splicing Analysis with Expedition Reveals Splicing Dynamics during Neuron Differentiation. , 2017, Molecular cell.

[14]  J. Marioni,et al.  Heterogeneity in Oct4 and Sox2 Targets Biases Cell Fate in 4-Cell Mouse Embryos , 2016, Cell.

[15]  Ruiqiang Li,et al.  Single-cell RNA-Seq profiling of human preimplantation embryos and embryonic stem cells , 2013, Nature Structural &Molecular Biology.

[16]  Yi Pan,et al.  Prediction of lncRNA–disease associations based on inductive matrix completion , 2018, Bioinform..

[17]  Richard W. Hamming,et al.  Error detecting and error correcting codes , 1950 .

[18]  Hao Jiang,et al.  Single cell clustering based on cell‐pair differentiability correlation and variance analysis , 2018, Bioinform..

[19]  Yaohang Li,et al.  Drug repositioning based on bounded nuclear norm regularization , 2019, Bioinform..

[20]  Fionn Murtagh,et al.  A Survey of Recent Advances in Hierarchical Clustering Algorithms , 1983, Comput. J..

[21]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[22]  Yi Pan,et al.  DyNetViewer: a Cytoscape app for dynamic network construction, analysis and visualization , 2018, Bioinform..

[23]  Yi Pan,et al.  Classification of Alzheimer's Disease Using Whole Brain Hierarchical Network. , 2016, IEEE/ACM transactions on computational biology and bioinformatics.

[24]  L. Foster,et al.  Evaluating measures of association for single-cell transcriptomics , 2019, Nature Methods.

[25]  Aleksandra A. Kolodziejczyk,et al.  Single Cell RNA-Sequencing of Pluripotent States Unlocks Modular Transcriptional Variation , 2015, Cell stem cell.

[26]  Pasquale De Meo,et al.  Generalized Louvain method for community detection in large networks , 2011, 2011 11th International Conference on Intelligent Systems Design and Applications.

[27]  Enric Llorens-Bobadilla,et al.  Single-Cell Transcriptomics Reveals a Population of Dormant Neural Stem Cells that Become Activated upon Brain Injury. , 2015, Cell stem cell.

[28]  Haiyan Huang,et al.  SIDEseq: A Cell Similarity Measure Defined by Shared Identified Differentially Expressed Genes for Single-Cell RNA sequencing Data , 2017, Statistics in Biosciences.

[29]  E. Pierson,et al.  ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression analysis , 2015, Genome Biology.

[30]  N. Neff,et al.  Reconstructing lineage hierarchies of the distal lung epithelium using single cell RNA-seq , 2014, Nature.

[31]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[32]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[33]  Lin Song,et al.  Comparison of co-expression measures: mutual information, correlation, and model based indices , 2012, BMC Bioinformatics.

[34]  Lincoln Stein,et al.  Reactome: a database of reactions, pathways and biological processes , 2010, Nucleic Acids Res..

[35]  Thomas Höfer,et al.  Robust classification of single-cell transcriptome data by nonnegative matrix factorization , 2017, Bioinform..

[36]  H. Ueda,et al.  Erratum to: Quartz-Seq: a highly reproducible and sensitive single-cell RNA sequencing method, reveals non-genetic gene-expression heterogeneity , 2017, Genome Biology.

[37]  Yi Pan,et al.  Classification of Alzheimer's Disease Using Whole Brain Hierarchical Network , 2018, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[38]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[39]  W. Reik,et al.  Single-cell epigenomics: powerful new methods for understanding gene regulation and cell identity , 2016, Genome Biology.

[40]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[41]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[42]  Yaohang Li,et al.  Computational drug repositioning using low-rank matrix approximation and randomized algorithms , 2018, Bioinform..

[43]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[44]  Minoru Kanehisa,et al.  KEGG as a reference resource for gene and protein annotation , 2015, Nucleic Acids Res..

[45]  Davis J. McCarthy,et al.  f-scLVM: scalable and versatile factor analysis for single-cell RNA-seq , 2017, Genome Biology.

[46]  Yi Pan,et al.  MCHMDA:Predicting Microbe-Disease Associations Based on Similarities and Low-Rank Matrix Completion , 2021, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[47]  J. Mesirov,et al.  The Molecular Signatures Database Hallmark Gene Set Collection , 2015 .

[48]  David R. Lovell,et al.  propr: An R-package for Identifying Proportionally Abundant Features Using Compositional Data Analysis , 2017, Scientific Reports.

[49]  Matthew D. Young,et al.  Gene ontology analysis for RNA-seq: accounting for selection bias , 2010, Genome Biology.

[50]  George Karypis,et al.  Hierarchical Clustering Algorithms for Document Datasets , 2005, Data Mining and Knowledge Discovery.

[51]  Antoni Ribas,et al.  Single-cell analysis tools for drug discovery and development , 2015, Nature Reviews Drug Discovery.

[52]  Jeong Eon Lee,et al.  Single-cell RNA-seq enables comprehensive tumour and immune cell profiling in primary breast cancer , 2017, Nature Communications.

[53]  S. Teichmann,et al.  Computational and analytical challenges in single-cell transcriptomics , 2015, Nature Reviews Genetics.

[54]  Yi Pan,et al.  BiXGBoost: a scalable, flexible boosting-based method for reconstructing gene regulatory networks , 2018, Bioinform..

[55]  Hong-Dong Li,et al.  Analysis of Single-Cell RNA-seq Data by Clustering Approaches , 2019, Current Bioinformatics.

[56]  Jose Davila-Velderrain,et al.  DECODE-ing sparsity patterns in single-cell RNA-seq , 2018, bioRxiv.

[57]  R. Sandberg,et al.  Full-Length mRNA-Seq from single cell levels of RNA and individual circulating tumor cells , 2012, Nature Biotechnology.

[58]  Ben S. Wittner,et al.  Single-Cell RNA Sequencing Identifies Extracellular Matrix Gene Expression by Pancreatic Circulating Tumor Cells , 2014, Cell reports.

[59]  Yi Pan,et al.  BRWMDA:Predicting Microbe-Disease Associations Based on Similarities and Bi-Random Walk on Disease and Microbe Networks , 2020, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[60]  Pablo Tamayo,et al.  Compendium of Immune Signatures Identifies Conserved and Species-Specific Biology in Response to Inflammation. , 2016, Immunity.

[61]  Feng Luo,et al.  DeepSignal: detecting DNA methylation state from Nanopore sequencing reads using deep-learning , 2018, bioRxiv.

[62]  Seyoung Park,et al.  Spectral clustering based on learning similarity matrix , 2018, Bioinform..

[63]  Yuanfang Guan,et al.  BaiHui: cross-species brain-specific network built with hundreds of hand-curated datasets , 2018, Bioinform..

[64]  Helga Thorvaldsdóttir,et al.  Molecular signatures database (MSigDB) 3.0 , 2011, Bioinform..

[65]  Chen Xu,et al.  Identification of cell types from single-cell transcriptomes using a novel clustering method , 2015, Bioinform..

[66]  Shaoqiang Zhang,et al.  Genome-wide de novo prediction of cis-regulatory binding sites in prokaryotes , 2009, Nucleic acids research.

[67]  Yi Pan,et al.  DNRLMF-MDA:Predicting microRNA-Disease Associations Based on Similarities of microRNAs and Diseases , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[68]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[69]  Xiaoshu Zhu,et al.  ClusterMine: a Knowledge-integrated Clustering Approach based on Expression Profiles of Gene Sets , 2018, bioRxiv.