Uncertainty Quantification in Multivariate Mixed Models for Mass Cytometry Data

Mass cytometry technology enables the simultaneous measurement of over 40 proteins on single cells. This has helped immunologists to increase their understanding of heterogeneity, complexity, and lineage relationships of white blood cells. Current statistical methods often collapse the rich single-cell data into summary statistics before proceeding with downstream analysis, discarding the information in these multivariate datasets. In this article, our aim is to exhibit the use of statistical analyses on the raw, uncompressed data thus improving replicability, and exposing multivariate patterns and their associated uncertainty profiles. We show that multivariate generative models are a valid alternative to univariate hypothesis testing. We propose two models: a multivariate Poisson log-normal mixed model and a logistic linear mixed model. We show that these models are complementary and that either model can account for different confounders. We use Hamiltonian Monte Carlo to provide Bayesian uncertainty quantification. Our models applied to a recent pregnancy study successfully reproduce key findings while quantifying increased overall protein-to-protein correlations between first and third trimester.

[1]  Mark D. Robinson,et al.  Comparison of Clustering Methods for High-Dimensional Single-Cell Flow and Mass Cytometry Data , 2016, bioRxiv.

[2]  David B. Dunson,et al.  Scalable Bayes via Barycenter in Wasserstein Space , 2015, J. Mach. Learn. Res..

[3]  Lorenzo Trippa,et al.  Mitigating Bias in Generalized Linear Mixed Models: The Case for Bayesian Nonparametrics. , 2016, Statistical science : a review journal of the Institute of Mathematical Statistics.

[4]  Stéphane Robin,et al.  Variational inference for probabilistic Poisson PCA , 2017, The Annals of Applied Statistics.

[5]  Atul J. Butte,et al.  Variation in the Human Immune System Is Largely Driven by Non-Heritable Influences , 2015, Cell.

[6]  R. Scheuermann,et al.  Elucidation of seventeen human peripheral blood B‐cell subsets and quantification of the tetanus response using a density‐based method for the automated identification of cell populations in multidimensional flow cytometry data , 2010, Cytometry. Part B, Clinical cytometry.

[7]  Christof Seiler,et al.  Multivariate Heteroscedasticity Models for Functional Brain Connectivity , 2017, bioRxiv.

[8]  Y. Saeys,et al.  Computational flow cytometry: helping to make sense of high-dimensional immunology data , 2016, Nature Reviews Immunology.

[9]  Tom Burr,et al.  Causation, Prediction, and Search , 2003, Technometrics.

[10]  Radford M. Neal MCMC Using Hamiltonian Dynamics , 2011, 1206.1901.

[11]  Stéphane Robin,et al.  Variational Inference for sparse network reconstruction from count data , 2018, ICML.

[12]  Dorota Kurowicka,et al.  Generating random correlation matrices based on vines and extended onion method , 2009, J. Multivar. Anal..

[13]  B. Becher,et al.  CyTOF workflow: differential discovery in high-throughput high-dimensional cytometry datasets , 2017, F1000Research.

[14]  John M. Olin Markov Chain Monte Carlo Analysis of Correlated Count Data , 2003 .

[15]  A. Poustka,et al.  Parameter estimation for the calibration and variance stabilization of microarray data , 2003, Statistical applications in genetics and molecular biology.

[16]  David M. Rocke,et al.  A Two-Component Model for Measurement Error in Analytical Chemistry , 1995 .

[17]  Bailey K. Fosdick,et al.  Modern Statistics for Modern Biology , 2020 .

[18]  Sean C. Bendall,et al.  Single-Cell Mass Cytometry of Differential Immune and Drug Responses Across a Human Hematopoietic Continuum , 2011, Science.

[19]  Mark M. Davis,et al.  Automatic Classification of Cellular Expression by Nonlinear Stochastic Embedding (ACCENSE) , 2013, Proceedings of the National Academy of Sciences.

[20]  Greg Finak,et al.  OpenCyto: An Open Source Infrastructure for Scalable, Robust, Reproducible, and Automated, End-to-End Flow Cytometry Data Analysis , 2014, PLoS Comput. Biol..

[21]  R. Tibshirani,et al.  An immune clock of human pregnancy , 2017, Science Immunology.

[22]  Sean C. Bendall,et al.  Normalization of mass cytometry data with bead standards , 2013, Cytometry. Part A : the journal of the International Society for Analytical Cytology.

[23]  Sean C. Bendall,et al.  An interactive reference framework for modeling a dynamic immune system , 2015, Science.

[24]  Iftekhar Naim,et al.  SWIFT—Scalable Clustering for Automated Identification of Rare Cell Populations in Large, High-Dimensional Flow Cytometry Datasets, Part 1: Algorithm Design , 2014, Cytometry. Part A : the journal of the International Society for Analytical Cytology.

[25]  John C. Marioni,et al.  Testing for differential abundance in mass cytometry data , 2017, Nature Methods.

[26]  Sean C. Bendall,et al.  Data-Driven Phenotypic Dissection of AML Reveals Progenitor-like Cells that Correlate with Prognosis , 2015, Cell.

[27]  Greg Finak,et al.  Merging Mixture Components for Cell Population Identification in Flow Cytometry , 2009, Adv. Bioinformatics.

[28]  R. Tibshirani,et al.  Automated identification of stratifying signatures in cellular subpopulations , 2014, Proceedings of the National Academy of Sciences.

[29]  Mark D. Robinson,et al.  diffcyt: Differential discovery in high-dimensional cytometry via high-resolution clustering , 2018, Communications Biology.

[30]  Hao Chen,et al.  Cytofkit: A Bioconductor Package for an Integrated Mass Cytometry Data Analysis Pipeline , 2016, PLoS Comput. Biol..

[31]  Thomas Häupl,et al.  immunoClust—An automated analysis pipeline for the identification of immunophenotypic signatures in high‐dimensional cytometric datasets , 2015, Cytometry. Part A : the journal of the International Society for Analytical Cytology.

[32]  D. Rubin,et al.  Causal Inference for Statistics, Social, and Biomedical Sciences: A General Method for Estimating Sampling Variances for Standard Estimators for Average Causal Effects , 2015 .

[33]  G. Nolan,et al.  Automated Mapping of Phenotype Space with Single-Cell Data , 2016, Nature Methods.

[34]  Michael Poidinger,et al.  High-dimensional analysis of the murine myeloid cell system , 2014, Nature Immunology.

[35]  Piet Demeester,et al.  FlowSOM: Using self‐organizing maps for visualization and interpretation of cytometry data , 2015, Cytometry. Part A : the journal of the International Society for Analytical Cytology.

[36]  Lisa M. Kronstad,et al.  Differential Induction of IFN-α and Modulation of CD112 and CD54 Expression Govern the Magnitude of NK Cell IFN-γ Response to Influenza A Viruses , 2018, The Journal of Immunology.

[37]  M. Graffar [Modern epidemiology]. , 1971, Bruxelles medical.

[38]  Guenther Walther,et al.  Science not art: statistically sound methods for identifying subsets in multi-dimensional flow and mass cytometry data sets , 2017, Nature Reviews Immunology.

[39]  Guenther Walther,et al.  AutoGate: automating analysis of flow cytometry data , 2014, Immunologic research.

[40]  S. Sealfon,et al.  flowPeaks: a fast unsupervised clustering for flow cytometry data via K-means and density peak finding , 2012, Bioinform..

[41]  Eirini Arvaniti,et al.  Sensitive detection of rare disease-associated cell subsets via representation learning , 2016, Nature Communications.

[42]  Arvind Gupta,et al.  Data reduction for spectral clustering to analyze high throughput flow cytometry data , 2010, BMC Bioinformatics.

[43]  J. Aitchison,et al.  The multivariate Poisson-log normal distribution , 1989 .

[44]  Mark M. Davis,et al.  CD38 is a key regulator of enhanced NK cell immune responses during pregnancy through its role in immune synapse formation , 2018, bioRxiv.

[45]  Jiqiang Guo,et al.  Stan: A Probabilistic Programming Language. , 2017, Journal of statistical software.

[46]  Ryan R Brinkman,et al.  Rapid cell population identification in flow cytometry data , 2011, Cytometry. Part A : the journal of the International Society for Analytical Cytology.

[47]  Bernhard Schölkopf,et al.  Elements of Causal Inference: Foundations and Learning Algorithms , 2017 .

[48]  Raphael Gottardo,et al.  flowClust: a Bioconductor package for automated gating of flow cytometry data , 2009, BMC Bioinformatics.

[49]  Judea Pearl,et al.  Causal Inference , 2010 .

[50]  Sean C. Bendall,et al.  Extracting a Cellular Hierarchy from High-dimensional Cytometry Data with SPADE , 2011, Nature Biotechnology.

[51]  A. Gelman Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper) , 2004 .

[52]  R. Albrecht,et al.  Distinct cross-reactive B-cell responses to live attenuated and inactivated influenza vaccines. , 2014, The Journal of infectious diseases.

[53]  Patrick O. Perry Fast moment‐based estimation for hierarchical models , 2015, 1504.04941.

[54]  J. Chan,et al.  A simple guide to the terminology and application of leucocyte monoclonal antibodies , 1988, Histopathology.

[55]  Paul D. McNicholas,et al.  A multivariate Poisson-log normal mixture model for clustering transcriptome sequencing data , 2017, BMC Bioinformatics.