论文信息 - Metric-Free Natural Gradient for Joint-Training of Boltzmann Machines - 字舞流文

Metric-Free Natural Gradient for Joint-Training of Boltzmann Machines

This paper introduces the Metric-Free Natural Gradient (MFNG) algorithm for training Boltzmann Machines. Similar in spirit to the Hessian-Free method of Martens [8], our algorithm belongs to the family of truncated Newton methods and exploits an efficient matrix-vector product to avoid explicitely storing the natural gradient metric $L$. This metric is shown to be the expected second derivative of the log-partition function (under the model distribution), or equivalently, the variance of the vector of partial derivatives of the energy function. We evaluate our method on the task of joint-training a 3-layer Deep Boltzmann Machine and show that MFNG does indeed have faster per-epoch convergence compared to Stochastic Maximum Likelihood with centering, though wall-clock performance is currently not competitive.

Razvan Pascanu | Yoshua Bengio | Aaron C. Courville | Guillaume Desjardins | Yoshua Bengio | Razvan Pascanu | Guillaume Desjardins

[1] M. Saunders,et al. Solution of Sparse Indefinite Systems of Linear Equations , 1975 .

[2] Shun-ichi Amari,et al. Differential-geometrical methods in statistics , 1985 .

[3] F. Götze. Differential-geometrical methods in statistics. Lecture notes in statistics - A. Shun-ichi. , 1987 .

[4] Shun-ichi Amari,et al. Information geometry of Boltzmann machines , 1992, IEEE Trans. Neural Networks.

[5] Barak A. Pearlmutter. Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[6] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[7] Shun-ichi Amari,et al. Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[8] L. Younes. On the convergence of markovian stochastic algorithms with rapidly decreasing ergodicity rates , 1999 .

[9] Radford M. Neal. Annealed importance sampling , 1998, Stat. Comput..

[10] Miguel Á. Carreira-Perpiñán,et al. On Contrastive Divergence Learning , 2005, AISTATS.

[11] Ruslan Salakhutdinov,et al. On the quantitative analysis of deep belief networks , 2008, ICML '08.

[12] Tijmen Tieleman,et al. Training restricted Boltzmann machines using approximations to the likelihood gradient , 2008, ICML '08.

[13] Geoffrey E. Hinton,et al. Deep Boltzmann Machines , 2009, AISTATS.

[14] James Martens,et al. Deep learning via Hessian-free optimization , 2010, ICML.

[15] O. Chapelle. Improved Preconditioner for Hessian Free Optimization , 2011 .

[16] Klaus-Robert Müller,et al. Deep Boltzmann Machines and the Centering Trick , 2012, Neural Networks: Tricks of the Trade.

[17] Nicol N. Schraudolph,et al. Centering Neural Network Gradient Factors , 1996, Neural Networks: Tricks of the Trade.

[18] Tapani Raiko,et al. Deep Learning Made Easier by Linear Transformations in Perceptrons , 2012, AISTATS.

[19] K. Müller,et al. Learning Feature Hierarchies with Centered Deep Boltzmann Machines , 2012, ArXiv.