Fisher-Pitman permutation tests based on nonparametric Poisson mixtures with application to single cell genomics

This paper investigates the theoretical and empirical performance of Fisher-Pitman-type permutation tests for assessing the equality of unknown Poisson mixture distributions. Building on nonparametric maximum likelihood estimators (NPMLEs) of the mixing distribution, these tests are theoretically shown to be able to adapt to complicated unspecified structures of count data and also consistent against their corresponding ANOVA-type alternatives; the latter is a result in parallel to classic claims made by Robinson (Robinson, 1973). The studied methods are then applied to a single-cell RNA-seq data obtained from different cell types from brain samples of autism subjects and healthy controls; empirically, they unveil genes that are differentially expressed between autism and control subjects yet are missed using common tests. For justifying their use, rate optimality of NPMLEs is also established in settings similar to nonparametric Gaussian (Wu and Yang, 2020a) and binomial mixtures (Tian et al., 2017; Vinayak et al., 2019).

[1]  Katarzyna Chawarska,et al.  Early generalized overgrowth in boys with autism. , 2011, Archives of general psychiatry.

[2]  Asymptotic Properties of Maximum Likelihood Estimates in the Mixed Poisson Model , 1984 .

[3]  K. Roeder,et al.  Uniqueness of estimation and identifiability in mixture models , 1993 .

[4]  N. Laird Nonparametric Maximum Likelihood Estimation of a Mixing Distribution , 1978 .

[5]  Dankmar Böhning,et al.  Numerical estimation of a probability measure , 1985 .

[6]  B. Lindsay Mixture models : theory, geometry, and applications , 1995 .

[7]  Matthew Stephens,et al.  Separating measurement and expression models clarifies confusion in single-cell RNA sequencing analysis , 2020, Nature Genetics.

[8]  Cun-Hui Zhang On Estimating Mixing Densities in Discrete Exponential Family Models , 1995 .

[9]  Dankmar Bohning Convergence of Simar's Algorithm for Finding the Maximum Likelihood Estimate of a Compound Poisson Process , 1982 .

[10]  R. Gottardo,et al.  Individual Level Differential Expression Analysis for Single Cell RNA-seq data , 2021, bioRxiv.

[11]  W. Hoeffding The Large-Sample Power of Tests Based on Permutations of Observations , 1952 .

[12]  Pengkun Yang,et al.  Polynomial Methods in Statistical Inference: Theory and Practice , 2020, Found. Trends Commun. Inf. Theory.

[13]  Yoav Zemel,et al.  Statistical Aspects of Wasserstein Distances , 2018, Annual Review of Statistics and Its Application.

[14]  O. Okoye,et al.  Refractive errors in children with autism in a developing country. , 2014, Nigerian journal of clinical practice.

[15]  J. Dubé,et al.  Implication of hypocholesterolemia in autism spectrum disorder and its associated comorbidities: A retrospective case–control study , 2019, Autism research : official journal of the International Society for Autism Research.

[16]  E. Pitman SIGNIFICANCE TESTS WHICH MAY BE APPLIED TO SAMPLES FROM ANY POPULATIONS III. THE ANALYSIS OF VARIANCE TEST , 1938 .

[17]  C. Calarge,et al.  Bone Mass in Boys with Autism Spectrum Disorder , 2017, Journal of autism and developmental disorders.

[18]  A. Nardi,et al.  Autism spectrum disorders: let’s talk about glucose? , 2019, Translational Psychiatry.

[19]  H. Muller,et al.  Fréchet regression for random objects with Euclidean predictors , 2016, The Annals of Statistics.

[20]  Nicolas W. Hengartner,et al.  Adaptive demixing in Poisson mixture models , 1997 .

[21]  Jun Cai,et al.  Mixtures of Exponential Distributions , 2006 .

[22]  Ing Rj Ser Approximation Theorems of Mathematical Statistics , 1980 .

[23]  Geng Chen,et al.  Single-Cell RNA-Seq Technologies and Related Computational Data Analysis , 2019, Front. Genet..

[24]  Paul D. McNicholas,et al.  A multivariate Poisson-log normal mixture model for clustering transcriptome sequencing data , 2017, BMC Bioinformatics.

[25]  Yanjun Han,et al.  The Optimality of Profile Maximum Likelihood in Estimating Sorted Discrete Distributions , 2020, ArXiv.

[26]  G. Chapuy Random permutations and their discrepancy process , 2007 .

[27]  F. Roueff,et al.  Nonparametric estimation of the mixing density using polynomials , 2010, 1002.4516.

[28]  Changbao Wu,et al.  Some Algorithmic Aspects of the Theory of Optimal Designs , 1978 .

[29]  M. Kendall Statistical Methods for Research Workers , 1937, Nature.

[30]  Nonparametric estimation of mixing densities for discrete distributions , 2005, math/0602217.

[31]  Yanjun Han,et al.  Minimax estimation of the L1 distance , 2016, 2016 IEEE International Symposium on Information Theory (ISIT).

[32]  Ramana V. Davuluri,et al.  NPEBseq: nonparametric empirical bayesian-based procedure for differential expression analysis of RNA-seq data , 2013, BMC Bioinformatics.

[33]  Yihong Wu,et al.  Self-regularizing Property of Nonparametric Maximum Likelihood Estimator in Mixture Models , 2020, 2008.08244.

[34]  Brett Baisch,et al.  Reaction Time of Children with and without Autistic Spectrum Disorders , 2017 .

[35]  Gilles Celeux,et al.  Co-expression analysis of high-throughput transcriptome sequencing data with Poisson mixture models , 2015, Bioinform..

[36]  Yuan Jiang,et al.  Modelling RNA‐Seq data with a zero‐inflated mixture Poisson linear model , 2019, Genetic epidemiology.

[37]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[38]  Marti J. Anderson,et al.  A new method for non-parametric multivariate analysis of variance in ecology , 2001 .

[39]  Sara A. van de Geer,et al.  Asymptotic theory for maximum likelihood in nonparametric mixture models , 2003, Comput. Stat. Data Anal..

[40]  S. Scherer,et al.  Variability of Creatine Metabolism Genes in Children with Autism Spectrum Disorder , 2017, International journal of molecular sciences.

[41]  Kenji Mori,et al.  Head circumference and body growth in autism spectrum disorders , 2011, Brain and Development.

[42]  Yu Zhu,et al.  PM-Seq: Using Finite Poisson Mixture Models for RNA-Seq Data Analysis and Transcript Expression Level Quantification , 2013 .

[43]  J. Hess,et al.  Analysis of variance , 2018, Transfusion.

[44]  Jingshu Wang,et al.  Gene expression recovery for single cell RNA sequencing , 2017, bioRxiv.

[45]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[46]  A. P. White,et al.  The approximate randomization test as an alternative to the F test in analysis of variance , 1981 .

[47]  Krishna R. Kalari,et al.  Beta-Poisson model for single-cell RNA-seq data analyses , 2016, Bioinform..

[48]  Tsachy Weissman,et al.  Concentration Inequalities for the Empirical Distribution , 2018, Information and Inference: A Journal of the IMA.

[49]  B. Lindsay The Geometry of Mixture Likelihoods: A General Theory , 1983 .

[50]  Cun-Hui Zhang,et al.  Rate of divergence of the nonparametric likelihood ratio test for Gaussian mixtures , 2019, Bernoulli.

[51]  L. Simar Maximum Likelihood Estimation of a Compound Poisson Process , 1976 .

[52]  GLOBAL PROPERTIES OF KERNEL ESTIMATORS FOR MIXING DENSITIES IN DISCRETE EXPONENTIAL FAMILY MODELS , 1996 .

[53]  A. Timan Theory of Approximation of Functions of a Real Variable , 1994 .

[54]  Maximilian Haeussler,et al.  Single-cell genomics identifies cell type–specific molecular changes in autism , 2019, Science.

[55]  Martin Jinye Zhang,et al.  Determining sequencing depth in a single-cell RNA-seq experiment , 2020, Nature Communications.

[56]  Yihong Wu,et al.  Optimal estimation of Gaussian mixtures via denoised method of moments , 2018, The Annals of Statistics.

[57]  J. I The Design of Experiments , 1936, Nature.

[58]  Stephannie L. Furtak,et al.  Examining the comorbidity of bipolar disorder and autism spectrum disorders: a large controlled analysis of phenotypic and familial correlates in a referred population of youth with bipolar I disorder with and without autism spectrum disorders. , 2013, The Journal of clinical psychiatry.

[59]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[60]  Shin Ta Liu,et al.  Permutation Methods: A Distance Function Approach , 2002, Technometrics.

[61]  Siamak Zamani Dadaneh,et al.  BNP-Seq: Bayesian Nonparametric Differential Expression Analysis of Sequencing Count Data , 2016, 1608.03991.

[62]  Robert J. Boik,et al.  The Fisher-Pitman permutation test: A non-robust alternative to the normal theory F test when variances are heterogeneous , 1987 .

[63]  Yanjun Han,et al.  Minimax Estimation of Functionals of Discrete Distributions , 2014, IEEE Transactions on Information Theory.

[64]  Bodhisattva Sen,et al.  Multivariate Rank-Based Distribution-Free Nonparametric Testing Using Measure Transportation , 2019, Journal of the American Statistical Association.

[65]  J. Kalbfleisch,et al.  An Algorithm for Computing the Nonparametric MLE of a Mixing Distribution , 1992 .

[66]  B. Lindsay The Geometry of Mixture Likelihoods, Part II: The Exponential Family , 1983 .

[67]  Jiahua Chen Consistency of the MLE under mixture models , 2016, 1607.01251.

[68]  J. Robinson,et al.  The Large-Sample Power of Permutation Tests for Randomization Models , 1973 .

[69]  Changbao Wu,et al.  Some iterative procedures for generating nonsingular optimal designs , 1978 .

[70]  Sara van de Geer,et al.  Rates of convergence for the maximum likelihood estimator in mixture models , 1996 .

[71]  P. Mielke,et al.  Moment approximations as an alternative to the F test in analysis of variance , 1983 .

[72]  Joseph P. Romano,et al.  EXACT AND ASYMPTOTICALLY ROBUST PERMUTATION TESTS , 2013, 1304.5939.

[73]  Sham M. Kakade,et al.  Maximum Likelihood Estimation for Learning Populations of Parameters , 2019, ICML.

[74]  L. A. Marascuilo,et al.  Nonparametric and Distribution-Free Methods for the Social Sciences , 1977 .

[75]  M. Drton,et al.  Distribution-Free Consistent Independence Tests via Center-Outward Ranks and Signs , 2019, Journal of the American Statistical Association.

[76]  J. Pfanzagl,et al.  Consistency of maximum likelihood estimators for certain nonparametric families, in particular: mixtures , 1988 .

[77]  E. V. van Someren,et al.  Insomnia Severity in Adults with Autism Spectrum Disorder is Associated with sensory Hyper-Reactivity and Social Skill Impairment , 2019, Journal of Autism and Developmental Disorders.

[78]  Yihong Wu,et al.  Minimax Rates of Entropy Estimation on Large Alphabets via Best Polynomial Approximation , 2014, IEEE Transactions on Information Theory.

[79]  Gregory Valiant,et al.  Learning Populations of Parameters , 2017, NIPS.

[80]  X. Nguyen Convergence of latent mixing measures in finite and infinite mixture models , 2011, 1109.3250.

[81]  Paul W. Mielke,et al.  34 Meteorological applications of permutation techniques based on distance functions , 1984, Nonparametric Methods.

[82]  Kenneth J. Berry,et al.  Multi-response permutation procedures for a priori classifications , 1976 .

[83]  Dankmar Böhning,et al.  A vertex-exchange-method in D-optimal design theory , 1986 .

[84]  R. Bass Convergence of probability measures , 2011 .

[85]  J. Kiefer,et al.  CONSISTENCY OF THE MAXIMUM LIKELIHOOD ESTIMATOR IN THE PRESENCE OF INFINITELY MANY INCIDENTAL PARAMETERS , 1956 .