Schoenberg-Rao distances: Entropy-based and geometry-aware statistical Hilbert distances

Distances between probability distributions that take into account the geometry of their sample space,like the Wasserstein or the Maximum Mean Discrepancy (MMD) distances have received a lot of attention in machine learning as they can, for instance, be used to compare probability distributions with disjoint supports. In this paper, we study a class of statistical Hilbert distances that we term the Schoenberg-Rao distances, a generalization of the MMD that allows one to consider a broader class of kernels, namely the conditionally negative semi-definite kernels. In particular, we introduce a principled way to construct such kernels and derive novel closed-form distances between mixtures of Gaussian distributions. These distances, derived from the concave Rao's quadratic entropy, enjoy nice theoretical properties and possess interpretable hyperparameters which can be tuned for specific applications. Our method constitutes a practical alternative to Wasserstein distances and we illustrate its efficiency on a broad range of machine learning tasks such as density estimation, generative modeling and mixture simplification.

[1]  Nozha Boujemaa,et al.  Conditionally Positive Definite Kernels for SVM Based Image Recognition , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[2]  HighWire Press Philosophical transactions of the Royal Society of London. Series A, Containing papers of a mathematical or physical character , 1896 .

[3]  L. Bregman The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming , 1967 .

[4]  Gabriel Peyré,et al.  Learning Generative Models with Sinkhorn Divergences , 2017, AISTATS.

[5]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[6]  Tom Leinster,et al.  Measuring diversity: the importance of species similarity. , 2012, Ecology.

[7]  C. R. Rao,et al.  Entropy differential metric, distance and divergence measures in probability spaces: A unified approach , 1982 .

[8]  Elena Deza,et al.  Dictionary of distances , 2006 .

[9]  Bernhard Schölkopf,et al.  The Kernel Trick for Distances , 2000, NIPS.

[10]  Michèle Basseville,et al.  Divergence measures for statistical data processing - An annotated bibliography , 2013, Signal Process..

[11]  Arnaud Doucet,et al.  Fast Computation of Wasserstein Barycenters , 2013, ICML.

[12]  F. Opitz Information geometry and its applications , 2012, 2012 9th European Radar Conference.

[13]  C. Ricotta,et al.  Towards a unifying approach to diversity measures: bridging the gap between the Shannon entropy and Rao's quadratic index. , 2006, Theoretical population biology.

[14]  D. Chessel,et al.  Measuring biological diversity using Euclidean metrics , 2002, Environmental and Ecological Statistics.

[15]  Matthias Hein,et al.  Hilbertian Metrics on Probability Measures and Their Application in SVM?s , 2004, DAGM-Symposium.

[16]  I. J. Schoenberg,et al.  Metric spaces and positive definite functions , 1938 .

[17]  Calyampudi R. Rao ANALYSIS OF DIVERSITY: A UNIFIED APPROACH , 1982 .

[18]  Flemming Topsøe,et al.  Jensen-Shannon divergence and Hilbert space embedding , 2004, International Symposium onInformation Theory, 2004. ISIT 2004. Proceedings..

[19]  Bernhard Schölkopf,et al.  Wasserstein Auto-Encoders , 2017, ICLR.

[20]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[21]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[22]  I. J. Schoenberg Remarks to Maurice Frechet's Article ``Sur La Definition Axiomatique D'Une Classe D'Espace Distances Vectoriellement Applicable Sur L'Espace De Hilbert , 1935 .

[23]  B. Fuglede Spirals in Hilbert space: With an application in information theory , 2005 .

[24]  Nobuaki Minematsu,et al.  A Study on Invariance of $f$-Divergence and Its Application to Speech Recognition , 2010, IEEE Transactions on Signal Processing.

[25]  Zoltán Botta-Dukát,et al.  Rao's quadratic entropy as a measure of functional diversity based on multiple traits , 2005 .

[26]  Kenji Fukumizu,et al.  On integral probability metrics, φ-divergences and binary classification , 2009, 0901.2698.

[27]  C. Berg,et al.  Harmonic Analysis on Semigroups: Theory of Positive Definite and Related Functions , 1984 .

[28]  Leo Liberti,et al.  Euclidean Distance Geometry and Applications , 2012, SIAM Rev..

[29]  Raif M. Rustamov,et al.  Closed‐form expressions for maximum mean discrepancy with applications to Wasserstein auto‐encoders , 2019, Stat.

[30]  Jose Gallego-Posada,et al.  GAIT: A Geometric Approach to Information Theory , 2020, AISTATS.

[31]  P. Mahalanobis On the generalized distance in statistics , 1936 .

[32]  Calyampudi R. Rao Diversity and dissimilarity coefficients: A unified approach☆ , 1982 .

[33]  David Beymer,et al.  Closed-Form Jensen-Renyi Divergence for Mixture of Gaussians and Applications to Group-Wise Shape Registration , 2009, MICCAI.

[34]  C. R. Rao,et al.  Cross entropy, dissimilarity measures, and characterizations of quadratic entropy , 1985, IEEE Trans. Inf. Theory.