scds: computational annotation of doublets in single-cell RNA sequencing data

Abstract Motivation Single-cell RNA sequencing (scRNA-seq) technologies enable the study of transcriptional heterogeneity at the resolution of individual cells and have an increasing impact on biomedical research. However, it is known that these methods sometimes wrongly consider two or more cells as single cells, and that a number of so-called doublets is present in the output of such experiments. Treating doublets as single cells in downstream analyses can severely bias a study’s conclusions, and therefore computational strategies for the identification of doublets are needed. Results With scds, we propose two new approaches for in silico doublet identification: Co-expression based doublet scoring (cxds) and binary classification based doublet scoring (bcds). The co-expression based approach, cxds, utilizes binarized (absence/presence) gene expression data and, employing a binomial model for the co-expression of pairs of genes, yields interpretable doublet annotations. bcds, on the other hand, uses a binary classification approach to discriminate artificial doublets from original data. We apply our methods and existing computational doublet identification approaches to four datasets with experimental doublet annotations and find that our methods perform at least as well as the state of the art, at comparably little computational cost. We observe appreciable differences between methods and across datasets and that no approach dominates all others. In summary, scds presents a scalable, competitive approach that allows for doublet annotation of datasets with thousands of cells in a matter of seconds. Availability and implementation scds is implemented as a Bioconductor R package (doi: 10.18129/B9.bioc.scds). Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Allon M Klein,et al.  Scrublet: Computational Identification of Cell Doublets in Single-Cell Transcriptomic Data. , 2019, Cell systems.

[2]  Allon M. Klein,et al.  Droplet Barcoding for Single-Cell Transcriptomics Applied to Embryonic Stem Cells , 2015, Cell.

[3]  L. J. K. Wee,et al.  Reference component analysis of single-cell transcriptomes elucidates cellular heterogeneity in human colorectal tumors , 2017, Nature Genetics.

[4]  J. Schug,et al.  Single-Cell Transcriptomics of the Human Endocrine Pancreas , 2016, Diabetes.

[5]  Sarah A. Teichmann,et al.  Single-cell analysis of CD4+ T-cell differentiation reveals three major cell states and progressive acceleration of proliferation , 2016, Genome Biology.

[6]  Charu C. Aggarwal,et al.  Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , 2016, KDD.

[7]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[8]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[9]  Alexandra Maslova,et al.  Single-Cell Transcriptome Profiling of Mouse and hESC-Derived Pancreatic Progenitors , 2018, bioRxiv.

[10]  Christopher S. McGinnis,et al.  DoubletFinder: Doublet detection in single-cell RNA sequencing data using artificial nearest neighbors , 2018, bioRxiv.

[11]  Sandrine Dudoit,et al.  Normalizing single-cell RNA sequencing data: challenges and opportunities , 2017, Nature Methods.

[12]  Xavier Robin,et al.  pROC: an open-source package for R and S+ to analyze and compare ROC curves , 2011, BMC Bioinformatics.

[13]  David J. Jörg,et al.  Defining murine organogenesis at single cell resolution reveals a role for the leukotriene pathway in regulating blood progenitor formation , 2018, Nature Cell Biology.

[14]  M. Hemberg,et al.  Challenges in unsupervised clustering of single-cell RNA-seq data , 2019, Nature Reviews Genetics.

[15]  Chun Jimmie Ye,et al.  Multiplexed droplet single-cell RNA-sequencing using natural genetic variation , 2017, Nature Biotechnology.

[16]  S. Potter,et al.  Single-cell RNA sequencing for the study of development, physiology and disease , 2018, Nature Reviews Nephrology.

[17]  J. Keilwagen,et al.  Area under Precision-Recall Curves for Weighted and Unweighted Data , 2014, PloS one.

[18]  Tijl De Bie,et al.  Subjectively Interesting Component Analysis: Data Projections that Contrast with Prior Expectations , 2016, KDD.

[19]  Salah Ayoub,et al.  Cell fixation and preservation for droplet-based single-cell transcriptomics , 2017, BMC Biology.

[20]  E. Birney,et al.  Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt , 2009, Nature Protocols.

[21]  Richard A. Muscat,et al.  Single-cell profiling of the developing mouse brain and spinal cord with split-pool barcoding , 2018, Science.

[22]  Alexander Lex,et al.  UpSetR: an R package for the visualization of intersecting sets and their properties , 2017, bioRxiv.

[23]  Ting Gong,et al.  DeconRNASeq: a statistical framework for deconvolution of heterogeneous tissue samples based on mRNA-Seq data , 2013, Bioinform..

[24]  S. Teichmann,et al.  Computational and analytical challenges in single-cell transcriptomics , 2015, Nature Reviews Genetics.

[25]  J. Marioni,et al.  Differentiation dynamics of mammary epithelial cells revealed by single-cell RNA sequencing , 2017, Nature Communications.

[26]  Mark Danielsen,et al.  An Introduction to the Analysis of Single-Cell RNA-Sequencing Data , 2018, Molecular therapy. Methods & clinical development.

[27]  Bertrand Z. Yeung,et al.  Cell Hashing with barcoded antibodies enables multiplexing and doublet detection for single cell genomics , 2018, Genome Biology.

[28]  Jennifer L Hu,et al.  MULTI-seq: sample multiplexing for single-cell RNA sequencing using lipid-tagged indices , 2019, Nature Methods.

[29]  Sean C. Bendall,et al.  Data-Driven Phenotypic Dissection of AML Reveals Progenitor-like Cells that Correlate with Prognosis , 2015, Cell.

[30]  I. Hellmann,et al.  Comparative Analysis of Single-Cell RNA Sequencing Methods , 2016, bioRxiv.

[31]  Davis J. McCarthy,et al.  A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor , 2016, F1000Research.

[32]  D. M. Smith,et al.  Single-Cell Transcriptome Profiling of Human Pancreatic Islets in Health and Type 2 Diabetes , 2016, Cell metabolism.

[33]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[34]  Paul Hoffman,et al.  Integrating single-cell transcriptomic data across different conditions, technologies, and species , 2018, Nature Biotechnology.