Evaluation of single-cell classifiers for single-cell RNA sequencing data sets

Single-cell RNA sequencing (scRNA-seq) has been rapidly developing and widely applied in biological and medical research. Identification of cell types in scRNA-seq data sets is an essential step before in-depth investigations of their functional and pathological roles. However, the conventional workflow based on clustering and marker genes is not scalable for an increasingly large number of scRNA-seq data sets due to complicated procedures and manual annotation. Therefore, a number of tools have been developed recently to predict cell types in new data sets using reference data sets. These methods have not been generally adapted due to a lack of tool benchmarking and user guidance. In this article, we performed a comprehensive and impartial evaluation of nine classification software tools specifically designed for scRNA-seq data sets. Results showed that Seurat based on random forest, SingleR based on correlation analysis and CaSTLe based on XGBoost performed better than others. A simple ensemble voting of all tools can improve the predictive accuracy. Under nonideal situations, such as small-sized and class-imbalanced reference data sets, tools based on cluster-level similarities have superior performance. However, even with the function of assigning 'unassigned' labels, it is still challenging to catch novel cell types by solely using any of the single-cell classifiers. This article provides a guideline for researchers to select and apply suitable classification tools in their analysis workflows and sheds some lights on potential direction of future improvement on classification tools.

[1]  Sohan Seth,et al.  scID: Identification of transcriptionally equivalent cell populations across single cell RNA-seq data using discriminant analysis , 2018 .

[2]  Fabian J Theis,et al.  The Human Cell Atlas , 2017, bioRxiv.

[3]  Amos Tanay,et al.  Single cell dissection of plasma cell heterogeneity in symptomatic and asymptomatic myeloma , 2018, Nature Medicine.

[4]  Quan Nguyen,et al.  scPred: Cell type prediction at single-cell resolution , 2018, bioRxiv.

[5]  James T. Webber,et al.  Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris , 2018, Nature.

[6]  Monika S. Kowalczyk,et al.  A Cancer Cell Program Promotes T Cell Exclusion and Resistance to Checkpoint Blockade , 2018, Cell.

[7]  Grace X. Y. Zheng,et al.  Massively parallel digital transcriptional profiling of single cells , 2016, Nature Communications.

[8]  Burak Dura,et al.  scFTD-seq: freeze-thaw lysis based, portable approach toward highly distributed single-cell 3′ mRNA profiling , 2018, Nucleic acids research.

[9]  M. Fulwyler,et al.  Electronic Separation of Biological Cells by Volume , 1965, Science.

[10]  Principal Investigators,et al.  Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris , 2018 .

[11]  Max Kuhn,et al.  Building Predictive Models in R Using the caret Package , 2008 .

[12]  Luke Zappia,et al.  Exploring the single-cell RNA-seq analysis landscape with the scRNA-tools database , 2017, bioRxiv.

[13]  Deanna M. Church,et al.  The emergent landscape of the mouse gut endoderm at single-cell resolution , 2019, Nature.

[14]  Florian Wagner,et al.  Moana: A robust and scalable cell type classification framework for single-cell RNA-Seq data , 2018, bioRxiv.

[15]  I. Nikaido,et al.  CellFishing.jl: an ultrafast and scalable cell search method for single-cell RNA sequencing , 2019, Genome Biology.

[16]  S. P. Fodor,et al.  Combinatorial labeling of single cells for gene expression cytometry , 2015, Science.

[17]  M. Hemberg,et al.  scmap: projection of single-cell RNA-seq data across data sets , 2018, Nature Methods.

[18]  Evan Z. Macosko,et al.  Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets , 2015, Cell.

[19]  Albert Coons: harnessing the power of the antibody. , 2016, The Lancet. Respiratory medicine.

[20]  J. C. Love,et al.  Seq-Well: portable, low-cost RNA sequencing of single cells at high throughput , 2017, Nature Methods.

[21]  Allon M. Klein,et al.  Droplet Barcoding for Single-Cell Transcriptomics Applied to Embryonic Stem Cells , 2015, Cell.

[22]  A. Regev,et al.  Spatial reconstruction of single-cell gene expression , 2015, Nature Biotechnology.

[23]  Bruce J. Aronow,et al.  Single-cell analysis of mixed-lineage states leading to a binary cell fate choice , 2016, Nature.

[24]  Dheeraj Malhotra,et al.  Altered human oligodendrocyte heterogeneity in multiple sclerosis , 2019, Nature.

[25]  Nikolaus Rajewsky,et al.  The Drosophila embryo at single-cell transcriptome resolution , 2017, Science.

[26]  Cesare Furlanello,et al.  A Comparison of MCC and CEN Error Measures in Multi-Class Prediction , 2010, PloS one.

[27]  S. Orkin,et al.  Mapping the Mouse Cell Atlas by Microwell-Seq , 2018, Cell.

[28]  Bonnie Berger,et al.  Efficient integration of heterogeneous single-cell transcriptomes using Scanorama , 2019, Nature Biotechnology.

[29]  Pauli Rämö,et al.  CellClassifier: supervised learning of cellular phenotypes , 2009, Bioinform..

[30]  Atul J. Butte,et al.  Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage , 2018, Nature Immunology.

[31]  Sara Ballouz,et al.  Characterizing the replicability of cell types defined by single cell RNA-sequencing data using MetaNeighbor , 2018, Nature Communications.

[32]  Yvan Saeys,et al.  A comparison of single-cell trajectory inference methods: towards more accurate and robust tools , 2018, bioRxiv.

[33]  Charlotte Soneson,et al.  Bias, robustness and scalability in single-cell differential expression analysis , 2018, Nature Methods.

[34]  Lihua Zhang,et al.  Comparison of Computational Methods for Imputing Single-Cell RNA-Sequencing Data , 2020, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[35]  Sabri Boughorbel,et al.  Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric , 2017, PloS one.

[36]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[37]  Luyi Tian,et al.  Comparison of clustering tools in R for medium-sized 10x Genomics single-cell RNA-sequencing data , 2018, F1000Research.

[38]  M. Robinson,et al.  A systematic performance evaluation of clustering methods for single-cell RNA-seq data. , 2018, F1000Research.

[39]  Laleh Haghverdi,et al.  Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors , 2018, Nature Biotechnology.

[40]  Boyang Li,et al.  Comparative analysis of differential gene expression analysis tools for single-cell RNA sequencing data , 2019, BMC Bioinformatics.

[41]  Paul Hoffman,et al.  Integrating single-cell transcriptomic data across different conditions, technologies, and species , 2018, Nature Biotechnology.

[42]  Åsa K. Björklund,et al.  Smart-seq2 for sensitive full-length transcriptome profiling in single cells , 2013, Nature Methods.

[43]  Lior Rokach,et al.  CaSTLe – Classification of single cells by transfer learning: Harnessing the power of publicly available single cell RNA sequencing experiments to annotate new experiments , 2018, PloS one.