scGMAI: a Gaussian mixture model for clustering single-cell RNA-Seq data based on deep autoencoder.

The rapid development of single-cell RNA sequencing (scRNA-Seq) technology provides strong technical support for accurate and efficient analyzing single-cell gene expression data. However, the analysis of scRNA-Seq is accompanied by many obstacles, including dropout events and the curse of dimensionality. Here, we propose the scGMAI, which is a new single-cell Gaussian mixture clustering method based on autoencoder networks and the fast independent component analysis (FastICA). Specifically, scGMAI utilizes autoencoder networks to reconstruct gene expression values from scRNA-Seq data and FastICA is used to reduce the dimensions of reconstructed data. The integration of these computational techniques in scGMAI leads to outperforming results compared to existing tools, including Seurat, in clustering cells from 17 public scRNA-Seq datasets. In summary, scGMAI is an effective tool for accurately clustering and identifying cell types from scRNA-Seq data and shows the great potential of its applicative power in scRNA-Seq data analysis. The source code is available at https://github.com/QUST-AIBBDRC/scGMAI/.

[1]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[2]  E. Pierson,et al.  ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression analysis , 2015, Genome Biology.

[3]  Rui Kuang,et al.  Machine learning and statistical methods for clustering single-cell RNA-sequencing data , 2019, Briefings Bioinform..

[4]  Paul Geladi,et al.  Principal Component Analysis , 1987, Comprehensive Chemometrics.

[5]  Cheng Chen,et al.  SubMito-XGBoost: predicting protein submitochondrial localization by fusing multiple feature information and eXtreme gradient boosting , 2020, Bioinform..

[6]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[7]  Z. Bar-Joseph,et al.  Using neural networks for reducing the dimensions of single-cell RNA-Seq data , 2017, Nucleic acids research.

[8]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[9]  Wei Chen,et al.  DIMM-SC: a Dirichlet mixture model for clustering droplet-based single cell transcriptomic data , 2017, Bioinform..

[10]  Chuong B Do,et al.  What is the expectation maximization algorithm? , 2008, Nature Biotechnology.

[11]  Nancy R. Zhang,et al.  SAVER: Gene expression recovery for single-cell RNA sequencing , 2018, Nature Methods.

[12]  Rhonda Bacher,et al.  Design and computational analysis of single-cell RNA-sequencing experiments , 2016, Genome Biology.

[13]  Xiaochen Wang,et al.  scRMD: Imputation for single cell RNA-seq data via robust matrix decomposition. , 2020, Bioinformatics.

[14]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[15]  Fabian J Theis,et al.  Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells , 2015, Nature Biotechnology.

[16]  Koji Tsuda,et al.  CellTree: an R/bioconductor package to infer the hierarchical structure of cell populations from single-cell RNA-seq data , 2016, BMC Bioinformatics.

[17]  Wei Vivian Li,et al.  An accurate and robust imputation method scImpute for single-cell RNA-seq data , 2018, Nature Communications.

[18]  Joshua W. K. Ho,et al.  CIDR: Ultrafast and accurate clustering through imputation for single-cell RNA-seq data , 2016, Genome Biology.

[19]  Xiaoying Wang,et al.  Protein-protein interaction sites prediction by ensemble random forests with synthetic minority oversampling technique , 2018, Bioinform..

[20]  Thelma Sáfadi,et al.  Independent Component Analysis (ICA) based-clustering of temporal RNA-seq data , 2017, PloS one.

[21]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[22]  Shibiao Wan,et al.  SHARP: hyperfast and accurate processing of single-cell RNA-seq data via ensemble random projection , 2020, Genome research.

[23]  Thomas Höfer,et al.  Robust classification of single-cell transcriptome data by nonnegative matrix factorization , 2017, Bioinform..

[24]  A. Regev,et al.  Spatial reconstruction of single-cell gene expression , 2015, Nature Biotechnology.

[25]  M. Schaub,et al.  SC3 - consensus clustering of single-cell RNA-Seq data , 2016, Nature Methods.

[26]  Yi Pan,et al.  SinNLRR: a robust subspace clustering method for cell type detection by non-negative and low-rank representation , 2019, Bioinform..

[27]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[28]  Qionghai Dai,et al.  Scalable analysis of cell-type composition from single-cell transcriptomics using deep recurrent learning , 2019, Nature Methods.

[29]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[30]  Lai Guan Ng,et al.  Dimensionality reduction for visualizing single-cell data using UMAP , 2018, Nature Biotechnology.

[31]  Yi Pan,et al.  BiXGBoost: a scalable, flexible boosting-based method for reconstructing gene regulatory networks , 2018, Bioinform..

[32]  Fabian J Theis,et al.  Single-cell RNA-seq denoising using a deep count autoencoder , 2019, Nature Communications.

[33]  Bo Wang,et al.  Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning , 2016, Nature Methods.

[34]  Quan Zou,et al.  Clustering and classification methods for single-cell RNA-sequencing data , 2020, Briefings Bioinform..

[35]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[36]  Rona S. Gertner,et al.  Single-cell transcriptomics reveals bimodality in expression and splicing in immune cells , 2013, Nature.

[37]  Chen Xu,et al.  Identification of cell types from single-cell transcriptomes using a novel clustering method , 2015, Bioinform..

[38]  Kevin R. Moon,et al.  Recovering Gene Interactions from Single-Cell Data Using Data Diffusion , 2018, Cell.

[39]  Jiahua Chen,et al.  Extended Bayesian information criteria for model selection with large model spaces , 2008 .

[40]  J. Pekar,et al.  A method for making group inferences from functional MRI data using independent component analysis , 2001, Human brain mapping.

[41]  Xiang Chen,et al.  An Adaptive Sparse Subspace Clustering for Cell Type Identification , 2020, Frontiers in Genetics.

[42]  Peter Van Loo,et al.  Single cell analysis of cancer genomes. , 2014, Current opinion in genetics & development.

[43]  Cathy Maugis,et al.  Transformation and model choice for RNA-seq co-expression analysis , 2016, bioRxiv.