Chain Rule Optimal Transport

We define a novel class of distances between statistical multivariate distributions by modeling an optimal transport problem on their marginals with respect to a ground distance defined on their conditionals. These new distances are metrics whenever the ground distance between the marginals is a metric, generalize both the Wasserstein distances between discrete measures and a recently introduced metric distance between statistical mixtures, and provide an upper bound for jointly convex distances between statistical mixtures. By entropic regularization of the optimal transport, we obtain a fast differentiable Sinkhorn-type distance. We experimentally evaluate our new family of distances by quantifying the upper bounds of several jointly convex distances between statistical mixtures, and by proposing a novel efficient method to learn Gaussian mixture models (GMMs) by simplifying kernel density estimators with respect to our distance. Our GMM learning technique experimentally improves significantly over the EM implementation of sklearn on the MNIST and Fashion MNIST datasets.

[1]  John R. Hershey,et al.  Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[2]  Marco Cuturi,et al.  Computational Optimal Transport , 2019 .

[3]  Frank Nielsen,et al.  On the chi square and higher-order chi distances for approximating f-divergences , 2013, IEEE Signal Processing Letters.

[4]  Shun-ichi Amari,et al.  Information Geometry and Its Applications , 2016 .

[5]  Filippo Santambrogio,et al.  Optimal Transport for Applied Mathematicians , 2015 .

[6]  Nicolas Courty,et al.  POT: Python Optimal Transport , 2021, J. Mach. Learn. Res..

[7]  W. Gangbo,et al.  The geometry of optimal transportation , 1996 .

[8]  Jens Vygen,et al.  Linear Programming Algorithms , 2012 .

[9]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[10]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[11]  Frank Nielsen,et al.  Learning Mixtures by Simplifying Kernel Density Estimators , 2013 .

[12]  Jean-Philippe Vert,et al.  Differentiable Sorting using Optimal Transport: The Sinkhorn CDF and Quantile Operator , 2019, NeurIPS 2019.

[13]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[14]  Ian R. Petersen,et al.  Probabilistic distances between finite-state finite-alphabet hidden Markov models , 2005, IEEE Transactions on Automatic Control.

[15]  Fumiyasu Komaki,et al.  Bayesian prediction based on a class of shrinkage priors for location-scale models , 2007 .

[16]  Frank Nielsen,et al.  A closed-form expression for the Sharma–Mittal entropy of exponential families , 2011, ArXiv.

[17]  L. Kantorovitch,et al.  On the Translocation of Masses , 1958 .

[18]  Frank Nielsen,et al.  A family of statistical symmetric divergences based on Jensen's inequality , 2010, ArXiv.

[19]  Asuka Takatsu Wasserstein geometry of Gaussian measures , 2011 .

[20]  D. Dowson,et al.  The Fréchet distance between multivariate normal distributions , 1982 .

[21]  Frank Nielsen,et al.  The statistical Minkowski distances: Closed-form formula for Gaussian Mixture Models , 2019, GSI.

[22]  Frank Nielsen,et al.  Generalized Bhattacharyya and Chernoff upper bounds on Bayes error using quasi-arithmetic means , 2014, Pattern Recognit. Lett..

[23]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[24]  Arthur Cayley,et al.  The Collected Mathematical Papers: On Monge's “Mémoire sur la théorie des déblais et des remblais” , 2009 .

[25]  Frank Nielsen,et al.  A closed-form expression for the Sharma–Mittal entropy of exponential families , 2011, ArXiv.

[26]  Peter Harremoës,et al.  Rényi Divergence and Kullback-Leibler Divergence , 2012, IEEE Transactions on Information Theory.

[27]  Shiri Gordon,et al.  An efficient image similarity measure based on approximations of KL-divergence between two gaussian mixtures , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[28]  Sever S Dragomir Some Inequalities for (m, M)-Convex Mappings and Applications for the Csiszár Φ-Divergence in Information Theory , 2001 .

[29]  T. Aaron Gulliver,et al.  Confliction of the Convexity and Metric Properties in f-Divergences , 2007, IEICE Trans. Fundam. Electron. Commun. Comput. Sci..

[30]  Frank Nielsen,et al.  Guaranteed Bounds on Information-Theoretic Measures of Univariate Mixtures Using Piecewise Log-Sum-Exp Inequalities , 2016, Entropy.

[31]  Rui F. Vigelis,et al.  Properties of a Generalized Divergence Related to Tsallis Generalized Divergence , 2020, IEEE Transactions on Information Theory.

[32]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[33]  M. Gelbrich On a Formula for the L2 Wasserstein Metric between Measures on Euclidean and Hilbert Spaces , 1990 .

[34]  Hagai Aronowitz,et al.  A distance measure between GMMs based on the unscented transform and its application to speaker recognition , 2005, INTERSPEECH.

[35]  Julien Rabin,et al.  Sliced and Radon Wasserstein Barycenters of Measures , 2014, Journal of Mathematical Imaging and Vision.

[36]  Shrikanth S. Narayanan,et al.  Upper Bound Kullback-Leibler Divergence for Hidden Markov Models with Application as Discrimination Measure for Speech Recognition , 2006, 2006 IEEE International Symposium on Information Theory.

[37]  Frank Nielsen,et al.  Visualizing bregman voronoi diagrams , 2007, SCG '07.

[38]  Brian Everitt,et al.  An Introduction to Latent Variable Models , 1984 .

[39]  Tryphon T. Georgiou,et al.  Optimal Transport for Gaussian Mixture Models , 2017, IEEE Access.

[40]  RYUNOSUKE OZAWA,et al.  Stability of RCD condition under concentration topology , 2019, Calculus of Variations and Partial Differential Equations.

[41]  Yoram Singer,et al.  Batch and On-Line Parameter Estimation of Gaussian Mixtures Based on the Joint Entropy , 1998, NIPS.

[42]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[43]  F. Santambrogio Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling , 2015 .

[44]  Heinz H. Bauschke,et al.  Joint and Separate Convexity of the Bregman Distance , 2001 .

[45]  F. Opitz Information geometry and its applications , 2012, 2012 9th European Radar Conference.

[46]  J. Pitrik,et al.  On the Joint Convexity of the Bregman Divergence of Matrices , 2014, 1405.7885.

[47]  Frank Nielsen,et al.  Closed-form information-theoretic divergences for statistical mixtures , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[48]  Pierre Maréchal,et al.  The role of perspective functions in convexity, polyconvexity, rank-one convexity and separate convexity , 2008 .

[49]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[50]  I. Vajda,et al.  A new class of metric divergences on probability spaces and its applicability in statistics , 2003 .

[51]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[52]  Jean-Philippe Thiran,et al.  Lower and upper bounds for approximation of the Kullback-Leibler divergence between Gaussian Mixture Models , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[53]  Qian Huang,et al.  A new distance measure for probability distribution function of mixture type , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[54]  Kuo-Chu Chang,et al.  Scalable fusion with mixture distributions in sensor networks , 2010, 2010 11th International Conference on Control Automation Robotics & Vision.

[55]  Ludger Rüschendorf,et al.  The Wasserstein distance and approximation theorems , 1985 .

[56]  Flemming Topsøe,et al.  Jensen-Shannon divergence and Hilbert space embedding , 2004, International Symposium onInformation Theory, 2004. ISIT 2004. Proceedings..

[57]  M. Do Fast approximation of Kullback-Leibler distance for dependence trees and hidden Markov models , 2003, IEEE Signal Processing Letters.

[58]  Marco Cuturi,et al.  Sinkhorn Distances: Lightspeed Computation of Optimal Transport , 2013, NIPS.

[59]  Frank Nielsen,et al.  GUARANTEED DETERMINISTIC BOUNDS ON THE TOTAL VARIATION DISTANCE BETWEEN UNIVARIATE MIXTURES , 2018, 2018 IEEE 28th International Workshop on Machine Learning for Signal Processing (MLSP).

[60]  J. Borwein,et al.  Convex Functions: Constructions, Characterizations and Counterexamples , 2010 .

[61]  I. Chuang,et al.  Quantum Computation and Quantum Information: Bibliography , 2010 .

[62]  Frank Nielsen,et al.  Statistical exponential families: A digest with flash cards , 2009, ArXiv.

[63]  Alain Trouvé,et al.  Interpolating between Optimal Transport and MMD using Sinkhorn Divergences , 2018, AISTATS.

[64]  Frank Nielsen,et al.  On w-mixtures: Finite convex combinations of prescribed component distributions , 2017, ArXiv.