Score Matching for Compositional Distributions

Compositional data and multivariate count data with known totals are challenging to analyse due to the non-negativity and sum-to-one constraints on the sample space. It is often the case that many of the compositional components are highly right-skewed, with large numbers of zeros. A major limitation of currently available estimators for compositional models is that they either cannot handle many zeros in the data or are not computationally feasible in moderate to high dimensions. We derive a new set of novel score matching estimators applicable to distributions on a Riemannian manifold with boundary, of which the standard simplex is a special case. The score matching method is applied to estimate the parameters in a new flexible truncation model for compositional data and we show that the estimators are scalable and available in closed form. Through extensive simulation studies, the scoring methodology is demonstrated to work well for estimating the parameters in the new truncation model and also for the Dirichlet distribution. We apply the new model and estimators to real microbiome compositional data and show that the model provides a good fit to the data. 1 ar X iv :2 01 2. 12 46 1v 1 [ st at .M E ] 2 3 D ec 2 02 0

[1]  Roman Vershynin,et al.  High-Dimensional Probability , 2018 .

[2]  Anuj Srivastava,et al.  Functional and Shape Data Analysis , 2016 .

[3]  A. Welsh,et al.  A Directional Mixed Effects Model for Compositional Expenditure Data , 2017 .

[4]  T. Kanamori,et al.  Estimating Density Models with Truncation Boundaries , 2019 .

[5]  J. Bear,et al.  A Logistic Normal Mixture Model for Compositional Data Allowing Essential Zeros , 2016 .

[6]  A. Wood,et al.  Scaled von Mises–Fisher Distributions and Regression Models for Paleomagnetic Directional Data , 2019, Journal of the American Statistical Association.

[7]  P. Guttorp,et al.  Statistical Interpretation of Species Composition , 2001 .

[8]  Aapo Hyvärinen,et al.  Estimation of Non-Normalized Statistical Models by Score Matching , 2005, J. Mach. Learn. Res..

[9]  Jingru Zhang,et al.  Scalable estimation and regularization for the logistic normal multinomial model , 2019, Biometrics.

[10]  Michail Tsagris,et al.  A folded model for compositional data analysis , 2018 .

[11]  Alan H. Welsh,et al.  Regression for compositional data by using distributions defined on the hypersphere , 2011 .

[12]  R Ascari,et al.  A new mixture model on the simplex , 2020, Stat. Comput..

[13]  F. Komaki,et al.  Scoring rules for statistical models on spheres , 2018, Statistics & Probability Letters.

[14]  Aapo Hyvärinen,et al.  Some extensions of score matching , 2007, Comput. Stat. Data Anal..

[15]  Chris A. Glasbey,et al.  A latent Gaussian model for compositional data with zeros , 2008 .

[16]  M. Drton,et al.  Generalized score matching for general domains. , 2020, Information and inference : a journal of the IMA.

[17]  Kanti V. Mardia,et al.  A New Estimation Methodology for Standard Directional Distributions , 2018, 2018 21st International Conference on Information Fusion (FUSION).

[18]  John Aitchison,et al.  The Statistical Analysis of Compositional Data , 1986 .

[19]  Roman Krzysztofowicz,et al.  Stochastic Bifurcation Processes and Distributions of Fractions , 1993 .

[20]  Frank Nielsen,et al.  Clustering in Hilbert’s Projective Geometry: The Case Studies of the Probability Simplex and the Elliptope of Correlation Matrices , 2018, Geometric Structures of Information.

[21]  J. Houwing-Duistermaat,et al.  The mixed model for the analysis of a repeated‐measurement multivariate count data , 2019, Statistics in medicine.

[22]  K. Mardia,et al.  Score matching estimators for directional distributions , 2016, 1604.08470.

[23]  Alan E. Gelfand,et al.  Spatial Regression Modeling for Compositional Data With Many Zeros , 2013 .

[24]  Chris Field,et al.  Managing the Essential Zeros in Quantitative Fatty Acid Signature Analysis , 2011 .

[25]  John M. Lee Riemannian Manifolds: An Introduction to Curvature , 1997 .

[26]  Brian S. Caffo,et al.  Empirical supremum rejection sampling , 2002 .

[27]  Alan H. Welsh,et al.  Fitting Kent models to compositional data with small concentration , 2014, Stat. Comput..

[28]  Timothy R. C. Read,et al.  Multinomial goodness-of-fit tests , 1984 .

[29]  Anuj Srivastava,et al.  Riemannian Analysis of Probability Density Functions with Applications in Vision , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  J. Mosimann On the compound multinomial distribution, the multivariate β-distribution, and correlations among proportions , 1962 .

[31]  Bohua Zhan,et al.  Smooth Manifolds , 2021, Arch. Formal Proofs.

[32]  Hongzhe Li Microbiome, Metagenomics, and High-Dimensional Compositional Data Analysis , 2015 .