Wasserstein Training of Restricted Boltzmann Machines

Boltzmann machines are able to learn highly complex, multimodal, structured and multiscale real-world data distributions. Parameters of the model are usually learned by minimizing the Kullback-Leibler (KL) divergence from training samples to the learned model. We propose in this work a novel approach for Boltzmann machine training which assumes that a meaningful metric between observations is known. This metric between observations can then be used to define the Wasserstein distance between the distribution induced by the Boltzmann machine on the one hand, and that given by the training sample on the other hand. We derive a gradient of that distance with respect to the model parameters. Minimization of this new objective leads to generative models with different statistical properties. We demonstrate their practical potential on data completion and denoising, for which the metric between observations plays a crucial role.

[1]  Geoffrey E. Hinton,et al.  Robust Boltzmann Machines for recognition and denoising , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Geoffrey E. Hinton,et al.  Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine , 2010, NIPS.

[3]  Geoffrey E. Hinton,et al.  A Learning Algorithm for Boltzmann Machines , 1985, Cogn. Sci..

[4]  Nando de Freitas,et al.  Inductive Principles for Restricted Boltzmann Machine Learning , 2010, AISTATS.

[5]  Arnaud Doucet,et al.  Fast Computation of Wasserstein Barycenters , 2013, ICML.

[6]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[7]  B. Ripley,et al.  Robust Statistics , 2018, Wiley Series in Probability and Statistics.

[8]  Carlo Tomasi,et al.  The Earth Mover’s Distance , 2001 .

[9]  Tapani Raiko,et al.  Enhanced Gradient for Training Restricted Boltzmann Machines , 2013, Neural Computation.

[10]  Geoffrey E. Hinton,et al.  Deep Boltzmann Machines , 2009, AISTATS.

[11]  C. Tomasi The Earth Mover's Distance, Multi-Dimensional Scaling, and Color-Based Image Retrieval , 1997 .

[12]  Nicolas Courty,et al.  Optimal Transport for Domain Adaptation , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Hossein Mobahi,et al.  Learning with a Wasserstein Loss , 2015, NIPS.

[14]  F. Bassetti,et al.  On minimum Kantorovich distance estimators , 2006 .

[15]  C. Villani Optimal Transport: Old and New , 2008 .

[16]  Marco Cuturi,et al.  Sinkhorn Distances: Lightspeed Computation of Optimal Transport , 2013, NIPS.

[17]  Tijmen Tieleman,et al.  Training restricted Boltzmann machines using approximations to the likelihood gradient , 2008, ICML '08.

[18]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[19]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[20]  Klaus-Robert Müller,et al.  Deep Boltzmann Machines and the Centering Trick , 2012, Neural Networks: Tricks of the Trade.