A systematic performance evaluation of clustering methods for single-cell RNA-seq data.

Subpopulation identification, usually via some form of unsupervised clustering, is a fundamental step in the analysis of many single-cell RNA-seq data sets. This has motivated the development and application of a broad range of clustering methods, based on various underlying algorithms. Here, we provide a systematic and extensible performance evaluation of 14 clustering algorithms implemented in R, including both methods developed explicitly for scRNA-seq data and more general-purpose methods. The methods were evaluated using nine publicly available scRNA-seq data sets as well as three simulations with varying degree of cluster separability. The same feature selection approaches were used for all methods, allowing us to focus on the investigation of the performance of the clustering algorithms themselves. We evaluated the ability of recovering known subpopulations, the stability and the run time and scalability of the methods. Additionally, we investigated whether the performance could be improved by generating consensus partitions from multiple individual clustering methods. We found substantial differences in the performance, run time and stability between the methods, with SC3 and Seurat showing the most favorable results. Additionally, we found that consensus clustering typically did not improve the performance compared to the best of the combined methods, but that several of the top-performing methods already perform some type of consensus clustering. All the code used for the evaluation is available on GitHub ( https://github.com/markrobinsonuzh/scRNAseq_clustering_comparison). In addition, an R package providing access to data and clustering results, thereby facilitating inclusion of new methods and data sets, is available from Bioconductor ( https://bioconductor.org/packages/DuoClustering2018).

[1]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[2]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[3]  W. Kruskal,et al.  Use of Ranks in One-Criterion Variance Analysis , 1952 .

[4]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[5]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[6]  L. Hubert,et al.  Comparing partitions , 1985 .

[7]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[8]  Kurt Hornik,et al.  A CLUE for CLUster Ensembles , 2005 .

[9]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[10]  Steve Horvath,et al.  WGCNA: an R package for weighted correlation network analysis , 2008, BMC Bioinformatics.

[11]  Catalin C. Barbacioru,et al.  mRNA-Seq whole-transcriptome analysis of a single cell , 2009, Nature Methods.

[12]  Ulrike von Luxburg,et al.  Clustering Stability: An Overview , 2010, Found. Trends Mach. Learn..

[13]  Åsa K. Björklund,et al.  Smart-seq2 for sensitive full-length transcriptome profiling in single cells , 2013, Nature Methods.

[14]  Greg Finak,et al.  Critical assessment of automated flow cytometry data analysis techniques , 2013, Nature Methods.

[15]  Laurens van der Maaten,et al.  Accelerating t-SNE using tree-based algorithms , 2014, J. Mach. Learn. Res..

[16]  Cole Trapnell,et al.  The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells , 2014, Nature Biotechnology.

[17]  Jüri Lember,et al.  Bridging Viterbi and posterior decoding: a generalized risk approach to hidden path inference based on hidden Markov models , 2014, J. Mach. Learn. Res..

[18]  A. Oudenaarden,et al.  Validation of noise models for single-cell transcriptomics , 2014, Nature Methods.

[19]  Aviv Regev,et al.  Deconstructing transcriptional heterogeneity in pluripotent stem cells , 2014, Nature.

[20]  A. Regev,et al.  Spatial reconstruction of single-cell gene expression , 2015, Nature Biotechnology.

[21]  Piet Demeester,et al.  FlowSOM: Using self‐organizing maps for visualization and interpretation of cytometry data , 2015, Cytometry. Part A : the journal of the International Society for Analytical Cytology.

[22]  S. Linnarsson,et al.  Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq , 2015, Science.

[23]  Allon M. Klein,et al.  Droplet Barcoding for Single-Cell Transcriptomics Applied to Embryonic Stem Cells , 2015, Cell.

[24]  Evan Z. Macosko,et al.  Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets , 2015, Cell.

[25]  Pang Wei Koh,et al.  An atlas of transcriptional, chromatin accessibility, and surface marker changes in human mesoderm development , 2016, Scientific Data.

[26]  J. Marioni,et al.  Pooling across cells to normalize single-cell RNA sequencing data with many zero counts , 2016, Genome Biology.

[27]  Martin Hemberg,et al.  Modelling dropouts allows for unbiased identification of marker genes in scRNASeq experiments , 2016 .

[28]  Mauro J. Muraro,et al.  De Novo Prediction of Stem Cell Identity using Single-Cell Transcriptome Data , 2016, Cell stem cell.

[29]  Rhonda Bacher,et al.  Design and computational analysis of single-cell RNA-sequencing experiments , 2016, Genome Biology.

[30]  Mark D. Robinson,et al.  Comparison of Clustering Methods for High-Dimensional Single-Cell Flow and Mass Cytometry Data , 2016, bioRxiv.

[31]  Lior Pachter,et al.  Fast and accurate single-cell RNA-seq analysis by clustering of transcript-compatibility counts , 2016, Genome Biology.

[32]  Christopher Yau,et al.  pcaReduce: hierarchical clustering of single cell transcriptional profiles , 2015, BMC Bioinformatics.

[33]  Hongkai Ji,et al.  TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis , 2016, Nucleic acids research.

[34]  Lior Pachter,et al.  Near-optimal probabilistic RNA-seq quantification , 2016, Nature Biotechnology.

[35]  Milica Ng,et al.  Cluster Headache: Comparing Clustering Tools for 10X Single Cell Sequencing Data , 2017, bioRxiv.

[36]  Grace X. Y. Zheng,et al.  Massively parallel digital transcriptional profiling of single cells , 2016, Nature Communications.

[37]  M. Schaub,et al.  SC3 - consensus clustering of single-cell RNA-Seq data , 2016, Nature Methods.

[38]  Hannah A. Pliner,et al.  Reversed graph embedding resolves complex single-cell trajectories , 2017, Nature Methods.

[39]  David A. Knowles,et al.  Batch effects and the effective design of single-cell gene expression studies , 2016, Scientific Reports.

[40]  A. Oshlack,et al.  Splatter: simulation of single-cell RNA sequencing data , 2017, Genome Biology.

[41]  Benjamin Haibe-Kains,et al.  Software for the integration of multi-omics experiments in Bioconductor , 2017, bioRxiv.

[42]  Valentine Svensson,et al.  Power Analysis of Single Cell RNA-Sequencing Experiments , 2016, Nature Methods.

[43]  Aedín C. Culhane,et al.  Software for the integration of multi-omics experiments in Bioconductor , 2017 .

[44]  Aaron T. L. Lun,et al.  Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R , 2017, Bioinform..

[45]  Joshua W. K. Ho,et al.  CIDR: Ultrafast and accurate clustering through imputation for single-cell RNA-seq data , 2016, Genome Biology.

[46]  Xin Mei,et al.  ascend: R package for analysis of single-cell RNA-seq data , 2017, bioRxiv.

[47]  Christoph Ziegenhain,et al.  Quantitative single-cell transcriptomics , 2018, Briefings in functional genomics.

[48]  Vilas Menon,et al.  Clustering single cells: a review of approaches on high-and low-depth single-cell RNA-seq data. , 2018, Briefings in functional genomics.

[49]  S. Teichmann,et al.  Exponential scaling of single-cell RNA-seq in the past decade , 2017, Nature Protocols.

[50]  Mark D. Robinson,et al.  Towards unified quality verification of synthetic count data with countsimQC , 2017, Bioinform..

[51]  Luyi Tian,et al.  Comparison of clustering tools in R for medium-sized 10x Genomics single-cell RNA-sequencing data , 2018, F1000Research.

[52]  Yuan Lin,et al.  SAFE-clustering: Single-cell Aggregated (From Ensemble) Clustering for Single-cell RNA-seq Data , 2017, bioRxiv.

[53]  Luyi Tian,et al.  Comparison of clustering tools in R for medium-sized 10x Genomics single-cell RNA-sequencing data , 2018, F1000Research.

[54]  Vilas Menon,et al.  Clustering single cells: a review of approaches on high-and low-depth single-cell RNA-seq data. , 2018, Briefings in functional genomics.

[55]  Charlotte Soneson,et al.  Bias, robustness and scalability in single-cell differential expression analysis , 2018, Nature Methods.

[56]  M. Hemberg,et al.  Identifying cell populations with scRNASeq. , 2017, Molecular aspects of medicine.

[57]  R. Irizarry,et al.  Missing data and technical variability in single‐cell RNA‐sequencing experiments , 2018, Biostatistics.

[58]  Luke Zappia,et al.  Exploring the single-cell RNA-seq analysis landscape with the scRNA-tools database , 2017, bioRxiv.

[59]  D T Severson,et al.  BEARscc determines robustness of single-cell clusters using simulated technical replicates , 2017, Nature Communications.

[60]  Yuan Lin,et al.  SAFE-clustering: Single-cell Aggregated (From Ensemble) Clustering for Single-cell RNA-seq Data , 2017, bioRxiv.