Block Mean Approximation for Efficient Second Order Optimization

Advanced optimization algorithms such as Newton method and AdaGrad benefit from second order derivative or second order statistics to achieve better descent directions and faster convergence rates. At their heart, such algorithms need to compute the inverse or inverse square root of a matrix whose size is quadratic of the dimensionality of the search space. For high dimensional search spaces, the matrix inversion or inversion of square root becomes overwhelming which in turn demands for approximate methods. In this work, we propose a new matrix approximation method which divides a matrix into blocks and represents each block by one or two numbers. The method allows efficient computation of matrix inverse and inverse square root. We apply our method to AdaGrad in training deep neural networks. Experiments show encouraging results compared to the diagonal approximation.

[1]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[2]  James Martens Second-order Optimization for Neural Networks , 2016 .

[3]  Andrzej Cichocki,et al.  A New Learning Algorithm for Blind Signal Separation , 1995, NIPS.

[4]  James Martens,et al.  New Insights and Perspectives on the Natural Gradient Method , 2014, J. Mach. Learn. Res..

[5]  Adrian J. Shepherd,et al.  Second-order methods for neural networks - fast and reliable training methods for multi-layer perceptrons , 1997, Perspectives in neural computing.

[6]  Yann LeCun,et al.  Improving the convergence of back-propagation learning with second-order methods , 1989 .

[7]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[8]  Yang You,et al.  Scaling SGD Batch Size to 32K for ImageNet Training , 2017, ArXiv.

[9]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10]  Yann Ollivier Riemannian metrics for neural networks , 2013, ArXiv.

[11]  David Barber,et al.  Practical Gauss-Newton Optimisation for Deep Learning , 2017, ICML.

[12]  Ph. Guillaume,et al.  A Block Constant Approximate Inverse for Preconditioning Large Linear Systems , 2002, SIAM J. Matrix Anal. Appl..

[13]  D. Marquardt An Algorithm for Least-Squares Estimation of Nonlinear Parameters , 1963 .

[14]  Nicolas Le Roux,et al.  Topmoumoute Online Natural Gradient Algorithm , 2007, NIPS.

[15]  Kenneth Levenberg A METHOD FOR THE SOLUTION OF CERTAIN NON – LINEAR PROBLEMS IN LEAST SQUARES , 1944 .

[16]  Yann Ollivier,et al.  Riemannian metrics for neural networks I: feedforward networks , 2013, 1303.0818.

[17]  Roger B. Grosse,et al.  A Kronecker-factored approximate Fisher matrix for convolution layers , 2016, ICML.

[18]  Richard F. Lyon,et al.  Neural Networks for Machine Learning , 2017 .

[19]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[20]  Roger B. Grosse,et al.  Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[21]  Edmond Chow,et al.  Approximate Inverse Techniques for Block-Partitioned Matrices , 1997, SIAM J. Sci. Comput..

[22]  Elad Hazan,et al.  Logarithmic regret algorithms for online convex optimization , 2006, Machine Learning.

[23]  Bharat Singh,et al.  Layer-Specific Adaptive Learning Rates for Deep Networks , 2015, 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA).

[24]  Juha Karhunen,et al.  Approximate Riemannian Conjugate Gradient Learning for Fixed-Form Variational Bayes , 2010, J. Mach. Learn. Res..

[25]  Joachim M. Buhmann,et al.  Scalable Adaptive Stochastic Optimization Using Random Projections , 2016, NIPS.

[26]  Sami Abu-El-Haija Proportionate gradient updates with PercentDelta , 2017, ArXiv.

[27]  Razvan Pascanu,et al.  Revisiting Natural Gradient for Deep Networks , 2013, ICLR.