scSSA: A clustering method for single cell RNA-seq data based on semi-supervised autoencoder.

BACKGROUND Single cell sequencing is a technology for high-throughput sequencing analysis of genome, transcriptome and epigenome at the single cell level. It can improve the shortcomings of traditional methods, reveal the gene structure and gene expression state of a single cell, and reflect the heterogeneity between cells. Among them, the clustering analysis of single-cell RNA data is a very important step, but the clustering of single-cell RNA data is faced with two difficulties, dropout events and dimension curse. At present, many methods are only driven by data, and do not make full use of the existing biological information. RESULTS In this work, we propose scSSA, a clustering model based on semi-supervised autoencoder, fast independent component analysis (FastICA) and Gaussian mixture clustering. Firstly, the semi-supervised autoencoder imputes and denoises the scRNA-seq data, and then get the low-dimensional latent representation. Secondly, the low-dimensional representation is reduced the dimension and clustered by FastICA and Gaussian mixture model respectively. Finally, scSSA is compared with Seurat, CIDR and other methods on 10 public scRNA-seq datasets. CONCLUSION The results show that scSSA has superior performance in cell clustering on 10 public datasets. In conclusion, scSSA can accurately identify the cell types and is generally applicable to all kinds of single cell datasets. scSSA has great application potential in the field of scRNA-seq data analysis. Details in the code have been uploaded to the website https://github.com/houtongshuai123/scSSA/.

[1]  C. Zheng,et al.  A New Graph Autoencoder-Based Consensus-Guided Model for scRNA-seq Cell Type Detection , 2022, IEEE Transactions on Neural Networks and Learning Systems.

[2]  Jiayi Dong,et al.  scSemiAE: a deep model with semi-supervised learning for single-cell transcriptomics , 2021, BMC Bioinformatics.

[3]  C. Zheng,et al.  scCDG: A Method Based on DAE and GCN for scRNA-Seq Data Analysis , 2021, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[4]  H. Hakonarson,et al.  Model-based deep embedding for constrained clustering analysis of single cell RNA-seq data , 2021, Nature Communications.

[5]  C. Zheng,et al.  SUSCC: Secondary Construction of Feature Space based on UMAP for Rapid and Accurate Clustering Large-scale Single Cell RNA-seq Data , 2021, Interdisciplinary Sciences: Computational Life Sciences.

[6]  Xiaolin Wang,et al.  scGMAI: a Gaussian mixture model for clustering single-cell RNA-Seq data based on deep autoencoder. , 2020, Briefings in bioinformatics.

[7]  Kui Wang,et al.  Deep learning enables accurate clustering with batch effect removal in single-cell RNA-seq analysis , 2020, Nature Communications.

[8]  Xiang Chen,et al.  An Adaptive Sparse Subspace Clustering for Cell Type Identification , 2020, Frontiers in Genetics.

[9]  Jian Hu,et al.  Iterative transfer learning with neural network for clustering and cell type classification in single-cell RNA-seq analysis , 2020, Nature Machine Intelligence.

[10]  Hayden Kwok-Hay So,et al.  PARC: ultrafast and accurate clustering of phenotypic data of millions of single cells , 2019, bioRxiv.

[11]  Rui Kuang,et al.  Machine learning and statistical methods for clustering single-cell RNA-sequencing data , 2019, Briefings Bioinform..

[12]  Lai Guan Ng,et al.  Dimensionality reduction for visualizing single-cell data using UMAP , 2018, Nature Biotechnology.

[13]  Kevin R. Moon,et al.  Recovering Gene Interactions from Single-Cell Data Using Data Diffusion , 2018, Cell.

[14]  Nancy R. Zhang,et al.  SAVER: Gene expression recovery for single-cell RNA sequencing , 2018, Nature Methods.

[15]  Fabian J. Theis,et al.  Single-cell RNA-seq denoising using a deep count autoencoder , 2018, Nature Communications.

[16]  Fabian J Theis,et al.  SCANPY: large-scale single-cell gene expression data analysis , 2018, Genome Biology.

[17]  Z. Bar-Joseph,et al.  Using neural networks for reducing the dimensions of single-cell RNA-Seq data , 2017, Nucleic acids research.

[18]  Mauricio Barahona,et al.  SC3 - consensus clustering of single-cell RNA-Seq data , 2016, Nature Methods.

[19]  Joshua W. K. Ho,et al.  CIDR: Ultrafast and accurate clustering through imputation for single-cell RNA-seq data , 2016, Genome Biology.

[20]  Cathy Maugis,et al.  Transformation and model choice for RNA-seq co-expression analysis , 2016, bioRxiv.

[21]  Hongkai Ji,et al.  TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis , 2016, Nucleic acids research.

[22]  Bo Wang,et al.  Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning , 2016, Nature Methods.

[23]  S. Linnarsson,et al.  Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq , 2015, Science.

[24]  A. Regev,et al.  Spatial reconstruction of single-cell gene expression , 2015, Nature Biotechnology.

[25]  Tianrui Li,et al.  An Improved Cop-Kmeans Clustering for Solving Constraint Violation Based on MapReduce Framework , 2013, Fundam. Informaticae.

[26]  Guo-Qiang Lo,et al.  CMOS compatible horizontal nanoplasmonic slot waveguides TE-pass polarizer on silicon-on-insulator platform. , 2013, Optics express.

[27]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[28]  K. Esbensen,et al.  Principal component analysis , 1987 .

[29]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[30]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[31]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..