Artificial-cell-type aware cell-type classification in CITE-seq

Cellular Indexing of Transcriptomes and Epitopes by sequencing (CITE-seq), couples the measurement of surface marker proteins with simultaneous sequencing of mRNA at single cell level, which brings accurate cell surface phenotyping to single cell transcriptomics. Unfortunately, multiplets in CITE-seq datasets create artificial cell types and complicates the automation of cell surface phenotyping. We propose CITE-sort, an artificial-cell-type aware surface marker clustering method for CITE-seq. CITE-sort is aware of and is robust to multiplet-induced artificial cell types. We benchmarked CITE-sort with real and simulated CITE-seq datasets and compared CITE-sort against canonical clustering methods. We show that CITE-sort produces the best clustering performance across the board. CITE-sort not only accurately identifies real biological cell types but also consistently and reliably separates multiplet-induced artificial-cell-type droplet clusters from real biological-cell-type droplet clusters. In addition, CITE-sort organizes its clustering process with a binary tree, which facilitates easy interpretation and verification of its clustering result and simplifies cell type annotation with domain knowledge in CITE-seq.

[1]  Laurens van der Maaten,et al.  Accelerating t-SNE using tree-based algorithms , 2014, J. Mach. Learn. Res..

[2]  Fabian J. Theis,et al.  The Human Lung Cell Atlas - A high-resolution reference map of the human lung in health and disease. , 2019, American journal of respiratory cell and molecular biology.

[3]  Lars Nielsen,et al.  Shedding light: The importance of reverse transcription efficiency standards in data interpretation , 2019, Biomolecular detection and quantification.

[4]  Yuan Yan Tang,et al.  Bayes Imbalance Impact Index: A Measure of Class Imbalanced Data Set for Classification Problem , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[5]  Evan Z. Macosko,et al.  Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets , 2015, Cell.

[6]  G. Nolan,et al.  Mass Cytometry: Single Cells, Many Features , 2016, Cell.

[7]  James Bailey,et al.  Adjusting for Chance Clustering Comparison Measures , 2015, J. Mach. Learn. Res..

[8]  Lu Wen,et al.  Single-Cell Transcriptome Analysis Maps the Developmental Track of the Human Heart. , 2019, Cell reports.

[9]  S. Teichmann,et al.  Computational and analytical challenges in single-cell transcriptomics , 2015, Nature Reviews Genetics.

[10]  Carl E. Rasmussen,et al.  Dirichlet Process Gaussian Mixture Models: Choice of the Base Distribution , 2010, Journal of Computer Science and Technology.

[11]  Rosalie K Chu,et al.  Influences of organic carbon speciation on hyporheic corridor biogeochemistry and microbial ecology , 2018, Nature Communications.

[12]  Bartosz Krawczyk,et al.  Learning from imbalanced data: open challenges and future directions , 2016, Progress in Artificial Intelligence.

[13]  M. Hemberg,et al.  Challenges in unsupervised clustering of single-cell RNA-seq data , 2019, Nature Reviews Genetics.

[14]  Wei Vivian Li,et al.  An accurate and robust imputation method scImpute for single-cell RNA-seq data , 2018, Nature Communications.

[15]  Yang Fan,et al.  Exploring of clustering algorithm on class-imbalanced data , 2013, 2013 8th International Conference on Computer Science & Education.

[16]  Wei Chen,et al.  Sample demultiplexing, multiplet detection, experiment planning and novel cell type verification in single cell sequencing , 2019, bioRxiv.

[17]  Michael I. Jordan,et al.  Variational inference for Dirichlet process mixtures , 2006 .

[18]  Esther Landhuis,et al.  Single-cell approaches to immune profiling , 2018, Nature.

[19]  Dawn M. E. Bowdish,et al.  An Introduction to Automated Flow Cytometry Gating Tools and Their Implementation , 2015, Front. Immunol..

[20]  Zev J. Gartner,et al.  DoubletFinder: Doublet detection in single-cell RNA sequencing data using artificial nearest neighbors , 2018, bioRxiv.

[21]  Purnima Bholowalia,et al.  EBK-Means: A Clustering Technique based on Elbow Method and K-Means in WSN , 2014 .

[22]  Dilan Görür,et al.  Dirichlet process Gaussian mixture models: choice of the base distribution , 2010 .

[23]  Christian Hennig,et al.  Methods for merging Gaussian mixture components , 2010, Adv. Data Anal. Classif..

[24]  Jinwen Ma,et al.  Asymptotic Convergence Rate of the EM Algorithm for Gaussian Mixtures , 2000, Neural Computation.

[25]  Maria Anna Rapsomaniki,et al.  A Single-Cell Atlas of the Tumor and Immune Ecosystem of Human Breast Cancer , 2019, Cell.

[26]  P. Kharchenko,et al.  Bayesian approach to single-cell differential expression analysis , 2014, Nature Methods.

[27]  Trygve E Bakken,et al.  Cell type discovery using single-cell transcriptomics: implications for ontological representation , 2018, Human molecular genetics.

[28]  Yuekai Sun,et al.  Statistical convergence of the EM algorithm on Gaussian mixture models , 2018, Electronic Journal of Statistics.

[29]  H. Swerdlow,et al.  Large-scale simultaneous measurement of epitopes and transcriptomes in single cells , 2017, Nature Methods.

[30]  R. Nussenblatt,et al.  Standardizing immunophenotyping for the Human Immunology Project , 2012, Nature Reviews Immunology.

[31]  Li Chen,et al.  A Bayesian mixture model for clustering droplet-based single-cell transcriptomic data from population studies , 2019, Nature Communications.

[32]  Fabian J Theis,et al.  SCANPY: large-scale single-cell gene expression data analysis , 2018, Genome Biology.

[33]  Allon M. Klein,et al.  Single cell analyses of development in the modern era , 2019, Development.

[34]  Charles Bouveyron,et al.  Model-Based Clustering and Classification for Data Science: With Applications in R , 2019 .

[35]  Johnny Ludvigsson,et al.  Mass Cytometry Identifies Distinct Subsets of Regulatory T Cells and Natural Killer Cells Associated With High Risk for Type 1 Diabetes , 2019, Front. Immunol..

[36]  Jia Qian Wu,et al.  Single-cell RNA-sequencing of the brain , 2017, Clinical and Translational Medicine.

[37]  Bertrand Z. Yeung,et al.  Cell Hashing with barcoded antibodies enables multiplexing and doublet detection for single cell genomics , 2018, Genome Biology.

[38]  Xun Zhu,et al.  Using single-cell multiple omics approaches to resolve tumor heterogeneity , 2017, Clinical and Translational Medicine.

[39]  Ruhong Zhou,et al.  A Public BCR Present in a Unique Dual-Receptor-Expressing Lymphocyte from Type 1 Diabetes Patients Encodes a Potent T Cell Autoantigen , 2019, Cell.

[40]  Daniel Gildea,et al.  Convergence of the EM Algorithm for Gaussian Mixtures with Unbalanced Mixing Coefficients , 2012, ICML.

[41]  Allon M Klein,et al.  Scrublet: Computational Identification of Cell Doublets in Single-Cell Transcriptomic Data. , 2019, Cell systems.