Algorithmic Clustering Of Single‐Cell Cytometry Data—How Unsupervised Are These Analyses Really?

ACROSS single-cell technologies, including flow and mass cytometry, as well as scRNA-seq, unsupervised clustering algorithms have become a staple of data analysis and are often hailed as a replacement for manual gating with the promise of an unbiased interrogation of the data. There is no shortage of software for the purpose and many tools are produced with user friendly graphical interfaces for the less programming inclined part of the community. The algorithms boast a wide range of features: some excel at detecting rare cell populations, some provide suggestions for the number of different cell subsets in the data, some are fast, some are highly reproducible, etc. Common to almost all of them is that they are oversold on at least one aspect: they almost never provide an unsupervised, unbiased answer at the click of a button, but rather prompt a semisupervised, iterative, interdisciplinary process of computational analysis (e.g. by a bioinformatician) and domain expert interpretation (e.g. by an immunologist, hematologist, disease specialist, etc.) until a biologically meaningful clustering is achieved (1) (Fig. 1). This is not to say that they are not useful—they certainly are—but the one-click, one size fits all analysis of single cell data remains elusive. In the wake of heavy developments in algorithms and tools follow extensive testing and reviewing. In a key review of cytometry clustering tools, Robinson & Weber (2016) (2) highlighted a number of algorithms performing well on parameters such as the ability to detect rare or even novel cell populations, the ability to produce results mirroring those achieved by manual gating of the data, the reproducibility of the results from run to run, and the run times of the algorithms. The FlowSOM algorithm (3) came out on top in terms of speed, which combined with good clustering reproducibility has made it a go-to algorithm in studies involving both flow and mass cytometry. The big advantage of FlowSOM and similar unsupervised clustering approaches over the traditional manual gating have been discussed extensively (1,2,4) with the key conclusion being that algorithmic clustering is not only more convenient than manual gating, but being unbiased by biological preconceptions, it also offers the potential to detect rare populations likely to be missed in manual approaches. There are, however, a number of features of automated clustering that users need to be aware of. Firstly, mathematically optimal clustering is not the same as biologically meaningful clustering. The unsupervised algorithms remain ignorant of decades of biological research, as well as the technical uncertainty of the data as produced by the various technologies (5,6). We may know for a fact that two markers are never expressed simultaneously on the same cell lineage, but if the expression of all other markers happen to be similar, the algorithm will be none the wiser and likely combine the two cells in single cluster. This property can be argued both a feature and a bug at the same time—unbiased, naive data analysis is more likely to reveal rare or novel cell populations, but given the highly knowledge-based approach to constructing the phenotyping panels in these studies, how unbiased can we really expect the analysis results to be? Secondly, when evaluating the accuracy of clustering algorithms, we face the problem that we lack an objective benchmark—when attempting to expand the horizon of our current knowledge, the truth of course becomes a subjective matter, and even when simply attempting to replicate basic existing knowledge, the benchmark is usually a manually gated population, subject

[1]  B. Becher,et al.  CyTOF workflow: differential discovery in high-throughput high-dimensional cytometry datasets , 2017, F1000Research.

[2]  Rossella Melchiotti,et al.  Cluster stability in the analysis of mass cytometry data , 2017, Cytometry. Part A : the journal of the International Society for Analytical Cytology.

[3]  Sean C. Bendall,et al.  Data-Driven Phenotypic Dissection of AML Reveals Progenitor-like Cells that Correlate with Prognosis , 2015, Cell.

[4]  Sean C. Bendall,et al.  Single-cell developmental classification of B cell precursor acute lymphoblastic leukemia at diagnosis reveals predictors of relapse , 2018, Nature Medicine.

[5]  Bjoern Peters,et al.  DAFi: A directed recursive data filtering and clustering approach for improving and interpreting data clustering identification of cell populations from polychromatic flow cytometry data , 2018, Cytometry. Part A : the journal of the International Society for Analytical Cytology.

[6]  Getting the Most from Your High-Dimensional Cytometry Data. , 2019, Immunity.

[7]  Mark D. Robinson,et al.  Comparison of Clustering Methods for High-Dimensional Single-Cell Flow and Mass Cytometry Data , 2016, bioRxiv.

[8]  Lai Guan Ng,et al.  Dimensionality reduction for visualizing single-cell data using UMAP , 2018, Nature Biotechnology.

[9]  G. Nolan,et al.  Automated Mapping of Phenotype Space with Single-Cell Data , 2016, Nature Methods.

[10]  Piet Demeester,et al.  FlowSOM: Using self‐organizing maps for visualization and interpretation of cytometry data , 2015, Cytometry. Part A : the journal of the International Society for Analytical Cytology.

[11]  Lars R Olsen,et al.  The anatomy of single cell mass cytometry data. , 2019, Cytometry. Part A : the journal of the International Society for Analytical Cytology.

[12]  Peng Qiu,et al.  Toward deterministic and semiautomated SPADE analysis , 2017, Cytometry. Part A : the journal of the International Society for Analytical Cytology.

[13]  Susan Holmes,et al.  Uncertainty Quantification in Multivariate Mixed Models for Mass Cytometry Data , 2019, 1903.07976.