A Novel Workflow for Semi-supervised Annotation of Cell-type Clusters in Mass Cytometry Data

Mass Cytometry by time-of-flight (CyTOF) is a widely used technology to study the variation in immune cell populations by simultaneously measuring the expression of 40-50 protein markers in millions of single cells. Traditionally, for the identification of cell types, a clustering method is employed which uses cell surface marker expression profiles to group similar cell-types. While being instrumental in analyzing the high-dimensional CyTOF datasets, current clustering-based strategies face a number of limitations. For instance, for larger datasets, sub-sampling is routinely performed (e.g. often only 10% or even less of all events are used), and randomly selected cells are assumed to be the representative of entire cell population [1]. The primary reason of sub-sampling is to reduce computational time and memory use, which consequently reduces the probability of annotating non-canonical cells with small population size along with significant data loss. Moreover, the clustering event of a cell to a given group varies with respect to neighboring cells, making the cell annotation difficult. This statistical reoccurrence of a given cell within a single cell-type cluster in spite of varying neighboring cells could be utilized for assigning it to a statistically most probable cell-type. Therefore, to extend the usability of existing approaches, we present a novel bootstrapping-based workflow, integrated with automated cell-type identification that predicts statistically reproducible cells clusters. Briefly, the method first creates blocks of a fixed number of randomly selected cells from each sample, which are then randomly concatenated to create an expression sub-matrix by picking one block from each sample. The cells in the sub-matrix are then subjected to cell-type annotation using the Linear Discriminant Analysis or ACDC algorithm [2]. The steps are repeated with unique expression sub-matrix in each iteration which provides a framework to test the annotation of every cell to one or more cell-types under varying neighbor cells. The statistical significance of cell-type association is measured by the frequency of cell occurrence in a given cell-type across all iterations. The spurious and unstable cell-type clusters are identified by the variation in the silhouette score, cluster size and average Euclidian distances in each cell-type cluster across all iterations. It is expected that stable clusters produce meaningful and reproducible results, whereas unstable and dynamic cell-type clusters can be considered for the identification of unknown/rare-cell types or they may represent batch affected cells contaminated with technical noise. We benchmarked the accuracy of the workflow by classifying 22 hand-gated cells from 38 markers obtained in replicative measurements of mass cytometry data from mice [3]. The preliminary results suggest ~85% accuracy in classification of different cell subtypes across 500 iterations. Currently, we are improving the performance of this approach by integrating faster (GPU-based) clustering methods and benchmarking with other public datasets with non-canonical cell-types.