Stochastic Spectral Descent for Discrete Graphical Models

Interest in deep probabilistic graphical models has increased in recent years, due to their state-of-the-art performance on many machine learning applications. Such models are typically trained with the stochastic gradient method, which can take a significant number of iterations to converge. Since the computational cost of gradient estimation is prohibitive even for modestly sized models, training becomes slow and practically usable models are kept small. In this paper we propose a new, largely tuning-free algorithm to address this problem. Our approach derives novel majorization bounds based on the Schatten- ∞ norm. Intriguingly, the minimizers of these bounds can be interpreted as gradient methods in a non-Euclidean space. We thus propose using a stochastic gradient method in non-Euclidean space. We both provide simple conditions under which our algorithm is guaranteed to converge, and demonstrate empirically that our algorithm leads to dramatically faster training and improved predictive ability compared to stochastic gradient descent for both directed and undirected graphical models.

[1]  Radford M. Neal Connectionist Learning of Belief Networks , 1992, Artif. Intell..

[2]  D. Böhning Multinomial logistic regression algorithm , 1992 .

[3]  Jonathan Eckstein,et al.  Nonlinear Proximal Point Algorithms Using Bregman Functions, with Applications to Convex Programming , 1993, Math. Oper. Res..

[4]  Michael I. Jordan,et al.  Mean Field Theory for Sigmoid Belief Networks , 1996, J. Artif. Intell. Res..

[5]  Brendan J. Frey,et al.  Variational Learning in Nonlinear Gaussian Belief Networks , 1999, Neural Computation.

[6]  Radford M. Neal Annealed importance sampling , 1998, Stat. Comput..

[7]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[8]  Marc Teboulle,et al.  Mirror descent and nonlinear projected subgradient methods for convex optimization , 2003, Oper. Res. Lett..

[9]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[10]  Alan L. Yuille,et al.  The Convergence of Contrastive Divergences , 2004, NIPS.

[11]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[12]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[13]  Ruslan Salakhutdinov,et al.  On the quantitative analysis of deep belief networks , 2008, ICML '08.

[14]  Ruslan Salakhutdinov,et al.  Evaluation methods for topic models , 2009, ICML '09.

[15]  Max Welling,et al.  Herding Dynamic Weights for Partially Observed Random Field Models , 2009, UAI.

[16]  Geoffrey E. Hinton,et al.  Deep Boltzmann Machines , 2009, AISTATS.

[17]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[18]  Geoffrey E. Hinton,et al.  Replicated Softmax: an Undirected Topic Model , 2009, NIPS.

[19]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[20]  Tapani Raiko,et al.  Enhanced Gradient and Adaptive Learning Rate for Training Restricted Boltzmann Machines , 2011, ICML.

[21]  Yurii Nesterov,et al.  Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems , 2012, SIAM J. Optim..

[22]  Yoshua Bengio,et al.  Practical Recommendations for Gradient-Based Training of Deep Architectures , 2012, Neural Networks: Tricks of the Trade.

[23]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[24]  Jorge Nocedal,et al.  Sample size selection in optimization methods for machine learning , 2012, Math. Program..

[25]  Nitish Srivastava,et al.  Modeling Documents with Deep Boltzmann Machines , 2013, UAI.

[26]  Nitish Srivastava,et al.  Modeling Documents with Deep Boltzmann Machines , 2013, UAI.

[27]  Liam Paninski,et al.  Auxiliary-variable Exact Hamiltonian Monte Carlo Samplers for Binary Distributions , 2013, NIPS.

[28]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[29]  Karol Gregor,et al.  Neural Variational Inference and Learning in Belief Networks , 2014, ICML.

[30]  T. Hohage,et al.  A Generalization of the Chambolle-Pock Algorithm to Banach Spaces with Applications to Inverse Problems , 2014, 1412.0126.

[31]  Yin Tat Lee,et al.  An Almost-Linear-Time Algorithm for Approximate Max Flow in Undirected Graphs, and its Multicommodity Generalizations , 2013, SODA.

[32]  Daan Wierstra,et al.  Deep AutoRegressive Networks , 2013, ICML.

[33]  Zhe Gan,et al.  Deep Temporal Sigmoid Belief Networks for Sequence Modeling , 2015, NIPS.

[34]  Volkan Cevher,et al.  Stochastic Spectral Descent for Restricted Boltzmann Machines , 2015, AISTATS.

[35]  Harm de Vries,et al.  RMSProp and equilibrated adaptive learning rates for non-convex optimization. , 2015 .

[36]  Zhe Gan,et al.  Learning Deep Sigmoid Belief Networks with Data Augmentation , 2015, AISTATS.

[37]  Ruslan Salakhutdinov,et al.  Accurate and conservative estimates of MRF log-likelihood using reverse annealing , 2014, AISTATS.

[38]  Volkan Cevher,et al.  Preconditioned Spectral Descent for Deep Learning , 2015, NIPS.