Wasserstein Training of Boltzmann Machines

The Boltzmann machine provides a useful framework to learn highly complex, multimodal and multiscale data distributions that occur in the real world. The default method to learn its parameters consists of minimizing the Kullback-Leibler (KL) divergence from training samples to the Boltzmann model. We propose in this work a novel approach for Boltzmann training which assumes that a meaningful metric between observations is given. This metric can be represented by the Wasserstein distance between distributions, for which we derive a gradient with respect to the model parameters. Minimization of this new Wasserstein objective leads to generative models that are better when considering the metric and that have a cluster-like structure. We demonstrate the practical potential of these models for data completion and denoising, for which the metric between observations plays a crucial role.

[1]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[2]  Geoffrey E. Hinton,et al.  Deep Boltzmann Machines , 2009, AISTATS.

[3]  Tijmen Tieleman,et al.  Training restricted Boltzmann machines using approximations to the likelihood gradient , 2008, ICML '08.

[4]  B. Ripley,et al.  Robust Statistics , 2018, Wiley Series in Probability and Statistics.

[5]  C. Villani Optimal Transport: Old and New , 2008 .

[6]  Geoffrey E. Hinton,et al.  Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine , 2010, NIPS.

[7]  Nando de Freitas,et al.  Inductive Principles for Restricted Boltzmann Machine Learning , 2010, AISTATS.

[8]  Terrence J. Sejnowski,et al.  A Learning Algorithm for Boltzmann Machines , 1985, Cognitive Sciences.

[9]  Arnaud Doucet,et al.  Fast Computation of Wasserstein Barycenters , 2013, ICML.

[10]  F. Bassetti,et al.  On minimum Kantorovich distance estimators , 2006 .

[11]  Tapani Raiko,et al.  Enhanced Gradient for Training Restricted Boltzmann Machines , 2013, Neural Computation.

[12]  Klaus-Robert Müller,et al.  Deep Boltzmann Machines and the Centering Trick , 2012, Neural Networks: Tricks of the Trade.

[13]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[14]  Geoffrey E. Hinton,et al.  Robust Boltzmann Machines for recognition and denoising , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  C. Tomasi The Earth Mover's Distance, Multi-Dimensional Scaling, and Color-Based Image Retrieval , 1997 .

[16]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[17]  Marco Cuturi,et al.  Sinkhorn Distances: Lightspeed Computation of Optimal Transport , 2013, NIPS.