Evaluation of methods to assign cell type labels to cell clusters from single-cell RNAsequencing data

Identification of cell type subpopulations from complex cell mixtures using single-cell RNA-sequencing (scRNA-seq) data includes automated computational steps like data normalization, dimensionality reduction and cell clustering. However, assigning cell type labels to cell clusters is still conducted manually by most researchers, resulting in limited documentation, low reproducibility and uncontrolled vocabularies. Two bottlenecks to automating this task are the scarcity of reference cell type gene expression signatures and that some dedicated methods are available only as web servers with limited cell type gene expression signatures. In this study, we benchmarked four methods (CIBERSORT, GSEA, GSVA, and ORA) for the task of assigning cell type labels to cell clusters from scRNA-seq data. We used scRNA-seq datasets from liver, peripheral blood mononuclear cells and retinal neurons for which reference cell type gene expression signatures were available. Our results show that, in general, all four methods show a high performance in the task as evaluated by Receiver Operating Characteristic curve analysis (average AUC = 0.94, sd = 0.036), whereas Precision-Recall curve analyses show a wide variation depending on the method and dataset (average AUC = 0.53, sd = 0.24). CIBERSORT and GSVA were the top two performers. Additionally, GSVA was the fastest of the four methods and was more robust in cell type gene expression signature subsampling simulations. We provide an extensible framework to evaluate other methods and datasets at https://github.com/jdime/scRNAseq_cell_cluster_labeling.

[1]  Ash A. Alizadeh,et al.  Robust enumeration of cell subsets from tissue expression profiles , 2015, Nature Methods.

[2]  Ziv Bar-Joseph,et al.  A web server for comparative analysis of single-cell RNA-seq data , 2018, Nature Communications.

[3]  Gary D Bader,et al.  Single cell RNA sequencing of human liver reveals distinct intrahepatic macrophage populations , 2018, Nature Communications.

[4]  Michael J. T. Stubbington,et al.  The Human Cell Atlas: from vision to reality , 2017, Nature.

[5]  Luyi Tian,et al.  Comparison of clustering tools in R for medium-sized 10x Genomics single-cell RNA-sequencing data , 2018, F1000Research.

[6]  Brian D. Aevermann,et al.  Cell type discovery and representation in the era of high-content single cell phenotyping , 2017, BMC Bioinformatics.

[7]  M. Robinson,et al.  A systematic performance evaluation of clustering methods for single-cell RNA-seq data. , 2018, F1000Research.

[8]  M. Ashburner,et al.  An ontology for cell types , 2005, Genome Biology.

[9]  Sara Ballouz,et al.  Characterizing the replicability of cell types defined by single cell RNA-sequencing data using MetaNeighbor , 2018, Nature Communications.

[10]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[11]  Paul Hoffman,et al.  Integrating single-cell transcriptomic data across different conditions, technologies, and species , 2018, Nature Biotechnology.

[12]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Grace X. Y. Zheng,et al.  Massively parallel digital transcriptional profiling of single cells , 2016, Nature Communications.

[14]  Evan Z. Macosko,et al.  Comprehensive Classification of Retinal Bipolar Neurons by Single-Cell Transcriptomics , 2016, Cell.

[15]  Justin Guinney,et al.  GSVA: gene set variation analysis for microarray and RNA-Seq data , 2013, BMC Bioinformatics.

[16]  Hanlee P. Ji,et al.  scPred: Cell type prediction at single-cell resolution , 2018, bioRxiv.

[17]  R. Fisher,et al.  The Logic of Inductive Inference , 1935 .

[18]  Gary D Bader,et al.  scClustViz – Single-cell RNAseq cluster assessment and visualization , 2018, F1000Research.

[19]  Peter Bühlmann,et al.  Analyzing gene expression data in terms of gene sets: methodological issues , 2007, Bioinform..