Distribution-Free Multisample Test Based on Optimal Matching with Applications to Single Cell Genomics

In this paper we propose a nonparametric graphical test based on optimal matching, for assessing the equality of multiple unknown multivariate probability distributions. Our procedure pools the data from the different classes to create a graph based on the minimum non-bipartite matching, and then utilizes the number of edges connecting data points from different classes to examine the closeness between the distributions. The proposed test is exactly distribution-free (the null distribution does not depend on the distribution of the data) and can be efficiently applied to multivariate as well as non-Euclidean data, whenever the inter-point distances are well-defined. We show that the test is universally consistent, and prove a distributional limit theorem for the test statistic under general alternatives. Through simulation studies, we demonstrate its superior performance against other common and well-known multisample tests. In scenarios where our test suggests distributional differences across classes, we also propose an approach for identifying which class or group contributes to this overall difference. The method is applied to single cell transcriptomics data obtained from the peripheral blood, cancer tissue, and tumor-adjacent normal tissue of human subjects with hepatocellular carcinoma and non-small-cell lung cancer. Our method unveils patterns in how biochemical metabolic pathways are altered across immune cells in a cancer setting, depending on the tissue location. All of the methods described herein are implemented in the R package multicross.

[1]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[2]  Boxi Kang,et al.  Global characterization of T cells in non-small-cell lung cancer by single-cell sequencing , 2018, Nature Medicine.

[3]  M. Lynch,et al.  The bioenergetic costs of a gene , 2015, Proceedings of the National Academy of Sciences.

[4]  B. Sen,et al.  Multivariate Ranks and Quantiles using Optimal Transportation and Applications to Goodness-of-fit Testing , 2019 .

[5]  Fabio Grassi,et al.  ATP Inhibits the Generation and Function of Regulatory T Cells Through the Activation of Purinergic P2X Receptors , 2011, Science Signaling.

[6]  A. Mood The Distribution Theory of Runs , 1940 .

[7]  N. Henze,et al.  On the multivariate runs test , 1999 .

[8]  Martin Raič A multivariate Berry--Esseen theorem with explicit constants , 2018 .

[9]  Zhuang Fengqing,et al.  Patients’ Responsibilities in Medical Ethics , 2016 .

[10]  N. Henze A MULTIVARIATE TWO-SAMPLE TEST BASED ON THE NUMBER OF NEAREST NEIGHBOR TYPE COINCIDENCES , 1988 .

[11]  J. Wolfowitz,et al.  On a Test Whether Two Samples are from the Same Population , 1940 .

[12]  M. Schilling Multivariate Two-Sample Tests Based on Nearest Neighbors , 1986 .

[13]  Anil K. Ghosh,et al.  A distribution-free two-sample run test applicable to high-dimensional data , 2014 .

[14]  Gene-Wei Li,et al.  Evolutionary Convergence of Pathway-Specific Enzyme Expression Stoichiometry , 2018, Cell.

[15]  B. Bhattacharya Two-Sample Tests Based on Geometric Graphs: Asymptotic Distribution and Detection Thresholds , 2015, 1512.00384.

[16]  Subhabrata Chakraborti,et al.  Nonparametric Statistical Inference , 2011, International Encyclopedia of Statistical Science.

[17]  B. Ogretmen,et al.  Sphingolipid metabolism in cancer signalling and therapy , 2017, Nature Reviews Cancer.

[18]  Boxi Kang,et al.  Landscape of Infiltrating T Cells in Liver Cancer Revealed by Single-Cell Sequencing , 2017, Cell.

[19]  Dylan S. Small,et al.  New multivariate tests for assessing covariate balance in matched observational studies , 2016, Biometrics.

[20]  Dylan S. Small,et al.  Using the Cross-Match Test to Appraise Covariate Balance in Matched Pairs , 2010 .

[21]  W. Kruskal,et al.  Use of Ranks in One-Criterion Variance Analysis , 1952 .

[22]  M. D. Ernst,et al.  Nonparametric Statistical Inference, Fourth Edition , 2005 .

[23]  Aaron M. Streets,et al.  Single-Cell Transcriptional Analysis. , 2017, Annual review of analytical chemistry.

[24]  Francisco Salavert,et al.  Actionable pathways: interactive discovery of therapeutic targets using signaling pathway models , 2016, Nucleic Acids Res..

[25]  P. Bickel A Distribution Free Version of the Smirnov Two Sample Test in the $p$-Variate Case , 1969 .

[26]  Dan Nettleton,et al.  Testing the equality of distributions of random vectors with categorical components , 2001 .

[27]  Minoru Kanehisa,et al.  KEGG: new perspectives on genomes, pathways, diseases and drugs , 2016, Nucleic Acids Res..

[28]  D. Matthews An overview of phenylalanine and tyrosine kinetics in humans. , 2007, The Journal of nutrition.

[29]  William Kruskal,et al.  A Nonparametric test for the Several Sample Problem , 1952 .

[30]  Lars Holst,et al.  Two Conditional Limit Theorems with Applications , 1979 .

[31]  Dylan S. Small,et al.  Sensitivity Analysis for the Cross-Match Test, With Applications in Genomics , 2010 .

[32]  Dylan S. Small,et al.  Hospitals with higher nurse staffing had lower odds of readmissions penalties than hospitals with lower staffing. , 2013, Health affairs.

[33]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[34]  Walter Kolch,et al.  Signaling pathway models as biomarkers: Patient-specific simulations of JNK activity predict the survival of neuroblastoma patients , 2015, Science Signaling.

[35]  Lionel Weiss,et al.  Two-Sample Tests for Multivariate Distributions , 1960 .

[36]  Ery Arias-Castro,et al.  On the Consistency of the Crossmatch Test , 2015, 1509.05790.

[37]  J. Friedman,et al.  Multivariate generalizations of the Wald--Wolfowitz and Smirnov two-sample tests , 1979 .

[38]  E. Gulbins,et al.  Sphingolipids in the lungs. , 2008, American journal of respiratory and critical care medicine.

[39]  John Parkinson,et al.  The conservation and evolutionary modularity of metabolism , 2009, Genome Biology.

[40]  Hao Chen,et al.  A Weighted Edge-Count Two-Sample Test for Multivariate and Object Data , 2016, Journal of the American Statistical Association.

[41]  Adam Petrie,et al.  Graph-theoretic multisample tests of equality in distribution for high dimensional data , 2016, Comput. Stat. Data Anal..

[42]  Jerome H. Friedman,et al.  A New Graph-Based Two-Sample Test for Multivariate and Object Data , 2013, 1307.6294.

[43]  L. Galluzzi,et al.  The spectrum of T cell metabolism in health and disease , 2017, Nature Reviews Immunology.

[44]  R. Bartoszynski,et al.  Reducing multidimensional two-sample data to one-dimensional interpoint comparisons , 1996 .

[45]  P. Rosenbaum An exact distribution‐free test comparing two multivariate distributions based on adjacency , 2005 .

[46]  Fabio Luciani,et al.  Impact of sequencing depth and read length on single cell RNA sequencing data of T cells , 2017, Scientific Reports.