Periodic Step Size Adaptation for Single Pass On-line Learning

It has been established that the second-order stochastic gradient descent (2SGD) method can potentially achieve generalization performance as well as empirical optimum in a single pass (i.e., epoch) through the training examples. However, 2SGD requires computing the inverse of the Hessian matrix of the loss function, which is prohibitively expensive. This paper presents Periodic Step-size Adaptation (PSA), which approximates the Jacobian matrix of the mapping function and explores a linear relation between the Jacobian and Hessian to approximate the Hessian periodically and achieve near-optimal results in experiments on a wide variety of models and tasks.

[1]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2007, ICML '07.

[2]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[3]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[4]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[5]  Burr Settles,et al.  Biomedical Named Entity Recognition using Conditional Random Fields and Rich Feature Sets , 2004, NLPBA/BioNLP.

[6]  Léon Bottou,et al.  On-line learning for very large data sets: Research Articles , 2005 .

[7]  Chun-Nan Hsu,et al.  Global and componentwise extrapolations for accelerating training of Bayesian networks and conditional random fields , 2009, Data Mining and Knowledge Discovery.

[8]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[9]  Mark W. Schmidt,et al.  Accelerated training of conditional random fields with stochastic gradient methods , 2006, ICML.

[10]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[11]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[12]  Léon Bottou,et al.  On-line learning for very large data sets , 2005 .

[13]  Chun-Nan Hsu,et al.  Global and Componentwise Extrapolation for Accelerating Data Mining from Large Incomplete Data Sets with the EM Algorithm , 2006, Sixth International Conference on Data Mining (ICDM'06).

[14]  Shun-ichi Amari,et al.  Statistical analysis of learning dynamics , 1999, Signal Process..

[15]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[16]  Nicolas Le Roux,et al.  Topmoumoute Online Natural Gradient Algorithm , 2007, NIPS.

[17]  Cheng-Ju Kuo,et al.  Rich Feature Set, Unification of Bidirectional Parsing and Dictionary Filtering for High F-Score Gene Mention Tagging. , 2007 .

[18]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[19]  Peter L. Bartlett,et al.  Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks , 2008, J. Mach. Learn. Res..

[20]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[21]  Yuh-Jye Lee,et al.  Periodic step-size adaptation in second-order gradient descent for single-pass on-line structured learning , 2009, Machine Learning.