Metric-Free Natural Gradient for Joint-Training of Boltzmann Machines

This paper introduces the Metric-Free Natural Gradient (MFNG) algorithm for training Boltzmann Machines. Similar in spirit to the Hessian-Free method of Martens [8], our algorithm belongs to the family of truncated Newton methods and exploits an efficient matrix-vector product to avoid explicitely storing the natural gradient metric $L$. This metric is shown to be the expected second derivative of the log-partition function (under the model distribution), or equivalently, the variance of the vector of partial derivatives of the energy function. We evaluate our method on the task of joint-training a 3-layer Deep Boltzmann Machine and show that MFNG does indeed have faster per-epoch convergence compared to Stochastic Maximum Likelihood with centering, though wall-clock performance is currently not competitive.

[1]  M. Saunders,et al.  Solution of Sparse Indefinite Systems of Linear Equations , 1975 .

[2]  Shun-ichi Amari,et al.  Differential-geometrical methods in statistics , 1985 .

[3]  F. Götze Differential-geometrical methods in statistics. Lecture notes in statistics - A. Shun-ichi. , 1987 .

[4]  Shun-ichi Amari,et al.  Information geometry of Boltzmann machines , 1992, IEEE Trans. Neural Networks.

[5]  Barak A. Pearlmutter Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[6]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[7]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[8]  L. Younes On the convergence of markovian stochastic algorithms with rapidly decreasing ergodicity rates , 1999 .

[9]  Radford M. Neal Annealed importance sampling , 1998, Stat. Comput..

[10]  Miguel Á. Carreira-Perpiñán,et al.  On Contrastive Divergence Learning , 2005, AISTATS.

[11]  Ruslan Salakhutdinov,et al.  On the quantitative analysis of deep belief networks , 2008, ICML '08.

[12]  Tijmen Tieleman,et al.  Training restricted Boltzmann machines using approximations to the likelihood gradient , 2008, ICML '08.

[13]  Geoffrey E. Hinton,et al.  Deep Boltzmann Machines , 2009, AISTATS.

[14]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[15]  O. Chapelle Improved Preconditioner for Hessian Free Optimization , 2011 .

[16]  Klaus-Robert Müller,et al.  Deep Boltzmann Machines and the Centering Trick , 2012, Neural Networks: Tricks of the Trade.

[17]  Nicol N. Schraudolph,et al.  Centering Neural Network Gradient Factors , 1996, Neural Networks: Tricks of the Trade.

[18]  Tapani Raiko,et al.  Deep Learning Made Easier by Linear Transformations in Perceptrons , 2012, AISTATS.

[19]  K. Müller,et al.  Learning Feature Hierarchies with Centered Deep Boltzmann Machines , 2012, ArXiv.