Periodic step-size adaptation in second-order gradient descent for single-pass on-line structured learning

It has been established that the second-order stochastic gradient descent (SGD) method can potentially achieve generalization performance as well as empirical optimum in a single pass through the training examples. However, second-order SGD requires computing the inverse of the Hessian matrix of the loss function, which is prohibitively expensive for structured prediction problems that usually involve a very high dimensional feature space. This paper presents a new second-order SGD method, called Periodic Step-size Adaptation (PSA). PSA approximates the Jacobian matrix of the mapping function and explores a linear relation between the Jacobian and Hessian to approximate the Hessian, which is proved to be simpler and more effective than directly approximating Hessian in an on-line setting. We tested PSA on a wide variety of models and tasks, including large scale sequence labeling tasks using conditional random fields and large scale classification tasks using linear support vector machines and convolutional neural networks. Experimental results show that single-pass performance of PSA is always very close to empirical optimum.

[1]  J. Miller Numerical Analysis , 1966, Nature.

[2]  James M. Ortega,et al.  Iterative solution of nonlinear equations in several variables , 2014, Computer science and applied mathematics.

[3]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[4]  J. Douglas Faires,et al.  Numerical Analysis , 1981 .

[5]  J. Traub Iterative Methods for the Solution of Equations , 1982 .

[6]  Scott E. Fahlman,et al.  An empirical study of learning speed in back-propagation networks , 1988 .

[7]  Yann LeCun,et al.  Improving the convergence of back-propagation learning with second-order methods , 1989 .

[8]  Geoffrey E. Hinton,et al.  Proceedings of the 1988 Connectionist Models Summer School , 1989 .

[9]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[10]  Xiao-Li Meng,et al.  On the global and componentwise rates of convergence of the EM algorithm , 1994 .

[11]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[12]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[13]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[14]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[15]  C. Fraley On Computing the Largest Fraction of Missing Information for the EM Algorithm and the Worst Linear F , 1998 .

[16]  Noboru Murata,et al.  A Statistical Study on On-line Learning , 1999 .

[17]  Stanley F. Chen,et al.  A Gaussian Prior for Smoothing Maximum Entropy Models , 1999 .

[18]  David Saad,et al.  On-Line Learning in Neural Networks , 1999 .

[19]  Shun-ichi Amari,et al.  Statistical analysis of learning dynamics , 1999, Signal Process..

[20]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[21]  David A. Forsyth,et al.  Shape, Contour and Grouping in Computer Vision , 1999, Lecture Notes in Computer Science.

[22]  Yoshua Bengio,et al.  Object Recognition with Gradient-Based Learning , 1999, Shape, Contour and Grouping in Computer Vision.

[23]  B. Schölkopf,et al.  Advances in kernel methods: support vector learning , 1999 .

[24]  David M. Rocke,et al.  Some computational issues in cluster analysis with no a priori metric , 1999 .

[25]  David E. Booth,et al.  Analysis of Incomplete Multivariate Data , 2000, Technometrics.

[26]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[27]  Rob Malouf,et al.  A Comparison of Algorithms for Maximum Entropy Parameter Estimation , 2002, CoNLL.

[28]  Nicol N. Schraudolph,et al.  Conjugate Directions for Stochastic Gradient Descent , 2002, ICANN.

[29]  James C. Spall,et al.  Introduction to stochastic search and optimization - estimation, simulation, and control , 2003, Wiley-Interscience series in discrete mathematics and optimization.

[30]  Yann LeCun,et al.  Large Scale Online Learning , 2003, NIPS.

[31]  Ruslan Salakhutdinov,et al.  Adaptive Overrelaxed Bound Optimization Methods , 2003, ICML.

[32]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[33]  Tim Hesterberg,et al.  Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control , 2004, Technometrics.

[34]  C. N Bouza,et al.  Spall, J.C. Introduction to stochastic search and optimization. Estimation, simulation and control. Wiley Interscience Series in Discrete Mathematics and Optimization, 2003 , 2004 .

[35]  Florentin Wörgötter,et al.  Advances in Neural Information Processing Systems 16 (NIPS 2003) , 2004 .

[36]  Burr Settles,et al.  Biomedical Named Entity Recognition using Conditional Random Fields and Rich Feature Sets , 2004, NLPBA/BioNLP.

[37]  Chun-Nan Hsu,et al.  Triple jump acceleration for the EM algorithm , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[38]  Ben Taskar,et al.  Learning structured prediction models: a large margin approach , 2005, ICML.

[39]  Léon Bottou,et al.  On-line learning for very large data sets , 2005 .

[40]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[41]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[42]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[43]  Chun-Nan Hsu,et al.  Global and Componentwise Extrapolation for Accelerating Data Mining from Large Incomplete Data Sets with the EM Algorithm , 2006, Sixth International Conference on Data Mining (ICDM'06).

[44]  Mark W. Schmidt,et al.  Accelerated training of conditional random fields with stochastic gradient methods , 2006, ICML.

[45]  Alexander J. Smola,et al.  Step Size Adaptation in Reproducing Kernel Hilbert Space , 2006, J. Mach. Learn. Res..

[46]  Jason Weston,et al.  Solving multiclass support vector machines with LaRank , 2007, ICML '07.

[47]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[48]  Jason Weston,et al.  Large-scale kernel machines , 2007 .

[49]  Nicolas Le Roux,et al.  Topmoumoute Online Natural Gradient Algorithm , 2007, NIPS.

[50]  James C. Spall,et al.  Introduction to Stochastic Search and Optimization. Estimation, Simulation, and Control (Spall, J.C. , 2007 .

[51]  Simon Günter,et al.  A Stochastic Quasi-Newton Method for Online Convex Optimization , 2007, AISTATS.

[52]  Peter L. Bartlett,et al.  Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks , 2008, J. Mach. Learn. Res..

[53]  Chun-Nan Hsu,et al.  Integrating high dimensional bi-directional parsing models for gene mention tagging , 2008, ISMB.

[54]  Botond Cseke,et al.  Advances in Neural Information Processing Systems 20 (NIPS 2007) , 2008 .

[55]  Chun-Nan Hsu,et al.  Global and componentwise extrapolations for accelerating training of Bayesian networks and conditional random fields , 2009, Data Mining and Knowledge Discovery.

[56]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[57]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..