How to Center Deep Boltzmann Machines

This work analyzes centered Restricted Boltzmann Machines (RBMs) and centered Deep Boltzmann Machines (DBMs), where centering is done by subtracting offset values from visible and hidden variables. We show analytically that (i) centered and normal Boltzmann Machines (BMs) and thus RBMs and DBMs are different parameterizations of the same model class, such that any normal BM/RBM/DBM can be transformed to an equivalent centered BM/RBM/DBM and vice versa, and that this equivalence generalizes to artificial neural networks in general, (ii) the expected performance of centered binary BMs/RBMs/DBMs is invariant under simultaneous flip of data and offsets, for any off-set value in the range of zero to one, (iii) centering can be reformulated as a different update rule for normal BMs/RBMs/DBMs, and (iv) using the enhanced gradient is equivalent to setting the offset values to the average over model and data mean. Furthermore, we present numerical simulations suggesting that (i) optimal generative performance is achieved by subtracting mean values from visible as well as hidden variables, (ii) centered binary RBMs/DBMs reach significantly higher log-likelihood values than normal binary RBMs/DBMs, (iii) centering variants whose offsets depend on the model mean, like the enhanced gradient, suffer from severe divergence problems, (iv) learning is stabilized if an exponentially moving average over the batch means is used for the offset values instead of the current batch mean, which also prevents the enhanced gradient from severe divergence, (v) on a similar level of log-likelihood values centered binary RBMs/DBMs have smaller weights and bigger bias parameters than normal binary RBMs/DBMs, (vi) centering leads to an update direction that is closer to the natural gradient, which is extremely efficient for training as we show for small binary RBMs, (vii) centering eliminates the need for greedy layer-wise pre-training of DBMs, which often even deteriorates the results independently of whether centering is used or not, and (ix) centering is also beneficial for auto encoders.

[1]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[2]  Hugo Larochelle,et al.  The Neural Autoregressive Distribution Estimator , 2011, AISTATS.

[3]  Yoshua Bengio,et al.  Tractable Multivariate Binary Density Estimation and the Restricted Boltzmann Forest , 2010, Neural Computation.

[4]  Marc'Aurelio Ranzato,et al.  Efficient Learning of Sparse Representations with an Energy-Based Model , 2006, NIPS.

[5]  Tapani Raiko,et al.  Parallel tempering is efficient for learning restricted Boltzmann machines , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[6]  Andreas C. Müller,et al.  Investigating Convergence of Restricted Boltzmann Machine Learning , 2010 .

[7]  Klaus-Robert Müller,et al.  Deep Boltzmann Machines and the Centering Trick , 2012, Neural Networks: Tricks of the Trade.

[8]  Asja Fischer,et al.  Training Restricted Boltzmann Machines , 2015, KI - Künstliche Intelligenz.

[9]  P. Tavan,et al.  Efficiency of exchange schemes in replica exchange , 2009 .

[10]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[11]  Tapani Raiko,et al.  Enhanced Gradient and Adaptive Learning Rate for Training Restricted Boltzmann Machines , 2011, ICML.

[12]  Geoffrey E. Hinton,et al.  Using fast weights to improve persistent contrastive divergence , 2009, ICML '09.

[13]  Pascal Vincent,et al.  Contractive Auto-Encoders: Explicit Invariance During Feature Extraction , 2011, ICML.

[14]  Tijmen Tieleman,et al.  Training restricted Boltzmann machines using approximations to the likelihood gradient , 2008, ICML '08.

[15]  Geoffrey E. Hinton,et al.  A Better Way to Pretrain Deep Boltzmann Machines , 2012, NIPS.

[16]  Ruslan Salakhutdinov,et al.  On the quantitative analysis of deep belief networks , 2008, ICML '08.

[17]  G. Kane Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol 1: Foundations, vol 2: Psychological and Biological Models , 1994 .

[18]  Razvan Pascanu,et al.  Metric-Free Natural Gradient for Joint-Training of Boltzmann Machines , 2013, ICLR.

[19]  Henry J. Kelley,et al.  Gradient Theory of Optimal Flight Paths , 1960 .

[20]  Anne Auger,et al.  Information-Geometric Optimization Algorithms: A Unifying Picture via Invariance Principles , 2011, J. Mach. Learn. Res..

[21]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[22]  Christian Igel,et al.  Training restricted Boltzmann machines: An introduction , 2014, Pattern Recognit..

[23]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[24]  Nan Wang,et al.  How to Center Binary Restricted Boltzmann Machines , 2013, ArXiv.

[25]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[26]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[27]  James L. McClelland,et al.  Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations , 1986 .

[28]  Kevin Swersky,et al.  Inductive Principles for Learning Restricted Boltzmann Machines , 2010 .

[29]  Nando de Freitas,et al.  A tutorial on stochastic approximation algorithms for training Restricted Boltzmann Machines and Deep Belief Nets , 2010, 2010 Information Theory and Applications Workshop (ITA).

[30]  Christian Igel,et al.  Bounding the Bias of Contrastive Divergence Learning , 2011, Neural Computation.

[31]  Tapani Raiko,et al.  Deep Learning Made Easier by Linear Transformations in Perceptrons , 2012, AISTATS.

[32]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[33]  David J. Field,et al.  Emergence of simple-cell receptive field properties by learning a sparse code for natural images , 1996, Nature.

[34]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[35]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[36]  Nicol N. Schraudolph,et al.  Centering Neural Network Gradient Factors , 1996, Neural Networks: Tricks of the Trade.

[37]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[38]  Ilya Sutskever,et al.  Data Normalization in the Learning of Restricted Boltzmann Machines , 2011 .

[39]  Geoffrey E. Hinton A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[40]  Tapani Raiko,et al.  Enhanced Gradient for Training Restricted Boltzmann Machines , 2013, Neural Computation.

[41]  Geoffrey E. Hinton,et al.  Deep Boltzmann Machines , 2009, AISTATS.

[42]  Yuhong Yang Information Theory, Inference, and Learning Algorithms. David J. C. MacKay , 2005 .

[43]  Pascal Vincent,et al.  Tempered Markov Chain Monte Carlo for training of Restricted Boltzmann Machines , 2010, AISTATS.

[44]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[45]  Ruslan Salakhutdinov,et al.  Scaling up Natural Gradient by Sparsely Factorizing the Inverse Fisher Matrix , 2015, ICML.

[46]  Shun-ichi Amari,et al.  Information geometry of Boltzmann machines , 1992, IEEE Trans. Neural Networks.

[47]  Benjamin Schwehn Using the Natural Gradient for training Restricted Boltzmann Machines , 2010 .

[48]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[49]  Christian Igel,et al.  A bound for the convergence rate of parallel tempering for sampling restricted Boltzmann machines , 2015, Theor. Comput. Sci..

[50]  Tapani Raiko,et al.  Gaussian-Bernoulli deep Boltzmann machine , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[51]  Christian Igel,et al.  Empirical Analysis of the Divergence of Gibbs Sampling Based Learning Algorithms for Restricted Boltzmann Machines , 2010, ICANN.

[52]  Wang,et al.  Replica Monte Carlo simulation of spin glasses. , 1986, Physical review letters.