Preprocessing of Single Cell RNA Sequencing Data Using Correlated Clustering and Projection

Single-cell RNA sequencing (scRNA-seq) is widely used to reveal heterogeneity in cells, which has given us insights into cell-cell communication, cell differentiation, and differential gene expression. However, analyzing scRNA-seq data is a challenge due to sparsity and the large number of genes involved. Therefore, dimensionality reduction and feature selection are important for removing spurious signals and enhancing the downstream analysis. We present Correlated Clustering and Projection (CCP), a new data-domain dimensionality reduction method, for the first time. CCP projects each cluster of similar genes into a supergene defined as the accumulated pairwise nonlinear gene-gene correlations among all cells. Using 14 benchmark data sets, we demonstrate that CCP has significant advantages over classical principal component analysis (PCA) for clustering and/or classification problems with intrinsically high dimensionality. In addition, we introduce the Residue-Similarity index (RSI) as a novel metric for clustering and classification and the R-S plot as a new visualization tool. We show that the RSI correlates with accuracy without requiring the knowledge of the true labels. The R-S plot provides a unique alternative to the uniform manifold approximation and projection (UMAP) and t-distributed stochastic neighbor embedding (t-SNE) for data with a large number of cell types.

[1]  Zhenqiu Shu,et al.  Robust Graph Regularized NMF with Dissimilarity and Similarity Constraints for ScRNA-seq Data Clustering , 2022, J. Chem. Inf. Model..

[2]  J. Li,et al.  PCA outperforms popular hidden variable inference methods for molecular QTL mapping , 2022, bioRxiv.

[3]  Liang Chen,et al.  scMRA: a robust deep learning method to annotate scRNA-seq data with multiple reference datasets , 2021, Bioinform..

[4]  Wenfei Jin,et al.  A topology-preserving dimensionality reduction method for single-cell RNA-seq data using graph autoencoder , 2021, Scientific Reports.

[5]  Q. Nie,et al.  DEEPsc: A Deep Learning-Based Map Connecting Single-Cell Transcriptomics and Spatial Imaging Data , 2021, Frontiers in Genetics.

[6]  Tallulah S Andrews,et al.  Tutorial: guidelines for the computational analysis of single-cell RNA sequencing data , 2020, Nature Protocols.

[7]  Wei Wang,et al.  A robust semi-supervised NMF model for single cell RNA-seq data , 2020, PeerJ.

[8]  Raphael Gottardo,et al.  Integrated analysis of multimodal single-cell data , 2020, Cell.

[9]  Lihua Zhang,et al.  Inference and analysis of cell-cell communication using CellChat , 2020, Nature Communications.

[10]  Rui Kuang,et al.  Machine learning and statistical methods for clustering single-cell RNA-sequencing data , 2019, Briefings Bioinform..

[11]  Sudipto Mukherjee,et al.  A deep adversarial variational autoencoder model for dimensionality reduction in single-cell RNA sequencing analysis , 2020, BMC Bioinformatics.

[12]  Alexey M. Kozlov,et al.  Eleven grand challenges in single-cell data science , 2020, Genome Biology.

[13]  Carlos Torroja,et al.  Digitaldlsorter: Deep-Learning on scRNA-Seq to Deconvolute Gene Expression Data , 2019, Front. Genet..

[14]  Yi Pan,et al.  SinNLRR: a robust subspace clustering method for cell type detection by non-negative and low-rank representation , 2019, Bioinform..

[15]  Juan Wang,et al.  Robust hypergraph regularized non-negative matrix factorization for sample clustering and feature selection in multi-view gene expression data , 2019, Human Genomics.

[16]  Wei Guo,et al.  SCINA: A Semi-Supervised Subtyping Algorithm of Single Cells and Bulk Samples , 2019, Genes.

[17]  Fabian J Theis,et al.  Current best practices in single‐cell RNA‐seq analysis: a tutorial , 2019, Molecular systems biology.

[18]  Geng Chen,et al.  Single-Cell RNA-Seq Technologies and Related Computational Data Analysis , 2019, Front. Genet..

[19]  Rafael A. Irizarry,et al.  Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model , 2019, Genome Biology.

[20]  Cole Trapnell,et al.  Supervised classification enables rapid annotation of cell atlases , 2019, Nature Methods.

[21]  Michael I. Jordan,et al.  Deep Generative Modeling for Single-cell Transcriptomics , 2018, Nature Methods.

[22]  Philipp Berens,et al.  The art of using t-SNE for single-cell transcriptomics , 2018, Nature Communications.

[23]  Jingyi Jessica Li,et al.  A statistical simulator scDesign for rational scRNA-seq experimental design , 2018, bioRxiv.

[24]  Jin Gu,et al.  VASC: Dimension Reduction and Visualization of Single-cell RNA-seq Data by Deep Variational Autoencoder , 2018, Genom. Proteom. Bioinform..

[25]  J. Lee,et al.  Single-cell RNA sequencing technologies and bioinformatics pipelines , 2018, Experimental & Molecular Medicine.

[26]  Cheng Liang,et al.  A graph regularized non-negative matrix factorization method for identifying microRNA-disease associations , 2018, Bioinform..

[27]  N. Hacohen,et al.  Single-cell RNA-seq reveals new types of human blood dendritic cells, monocytes, and progenitors , 2017, Science.

[28]  Feng Liu,et al.  A joint-L2, 1-norm-constraint-based semi-supervised feature extraction for RNA-Seq data analysis , 2017, Neurocomputing.

[29]  Renan Valieris,et al.  Human dendritic cells (DCs) are derived from distinct circulating precursors that are precommitted to become CD1c+ or CD141+ DCs , 2016, The Journal of experimental medicine.

[30]  Samuel L. Wolock,et al.  A Single-Cell Transcriptomic Map of the Human and Mouse Pancreas Reveals Inter- and Intra-cell Population Structure. , 2016, Cell systems.

[31]  Mauricio Barahona,et al.  SC3 - consensus clustering of single-cell RNA-Seq data , 2016, Nature Methods.

[32]  R. Stewart,et al.  Single-cell RNA-seq reveals novel regulators of human embryonic stem cell differentiation to definitive endoderm , 2016, Genome Biology.

[33]  Stephen R Quake,et al.  Cellular Taxonomy of the Mouse Striatum as Revealed by Single-Cell RNA-Seq. , 2016, Cell reports.

[34]  Bo Wang,et al.  Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning , 2016, Nature Methods.

[35]  Jorge Cadima,et al.  Principal component analysis: a review and recent developments , 2016, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[36]  Monika S. Kowalczyk,et al.  Single-cell RNA-seq reveals changes in cell cycle and differentiation programs upon aging of hematopoietic stem cells , 2015, Genome research.

[37]  S. Quake,et al.  A survey of human brain transcriptome diversity at the single cell level , 2015, Proceedings of the National Academy of Sciences.

[38]  R. Sandberg,et al.  Single-Cell RNA-Seq Reveals Dynamic, Random Monoallelic Gene Expression in Mammalian Cells , 2014, Science.

[39]  Kelin Xia,et al.  Multiscale multiphysics and multidomain models--flexibility and rigidity. , 2013, The Journal of chemical physics.

[40]  Ying Dai,et al.  Principal component analysis based methods in bioinformatics studies , 2011, Briefings Bioinform..

[41]  G. Pan,et al.  FGF2 sustains NANOG and switches the outcome of BMP4-induced human embryonic stem cell differentiation. , 2011, Cell stem cell.

[42]  Chikara Furusawa,et al.  Characterization of mesendoderm: a diverging point of the definitive endoderm and mesoderm in embryonic stem cell differentiation culture , 2005, Development.

[43]  J. Smith,et al.  Induction of the mesendoderm in the zebrafish germ ring by yolk cell-derived TGF-beta family signals and discrimination of mesoderm and endoderm by FGF. , 1999, Development.