A Convergence Analysis of Log-Linear Training

Log-linear models are widely used probability models for statistical pattern recognition. Typically, log-linear models are trained according to a convex criterion. In recent years, the interest in log-linear models has greatly increased. The optimization of log-linear model parameters is costly and therefore an important topic, in particular for large-scale applications. Different optimization algorithms have been evaluated empirically in many papers. In this work, we analyze the optimization problem analytically and show that the training of log-linear models can be highly ill-conditioned. We verify our findings on two handwriting tasks. By making use of our convergence analysis, we obtain good results on a large-scale continuous handwriting recognition task with a simple and generic approach.

[1]  Ben Taskar,et al.  An Introduction to Conditional Random Fields for Relational Learning , 2007 .

[2]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[3]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[4]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields for Relational Learning , 2007 .

[5]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[6]  Thomas P. Minka,et al.  Algorithms for maximum-likelihood logistic regression , 2003 .

[7]  Zoubin Ghahramani,et al.  On the Convergence of Bound Optimization Algorithms , 2002, UAI.

[8]  Georg Heigold,et al.  Confidence- and margin-based MMI/MPE discriminative training for off-line handwriting recognition , 2011, International Journal on Document Analysis and Recognition (IJDAR).

[9]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[10]  Salvador España Boquera,et al.  Improving Offline Handwritten Text Recognition with Hybrid HMM/ANN Models , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Rob Malouf,et al.  A Comparison of Algorithms for Maximum Entropy Parameter Estimation , 2002, CoNLL.

[12]  Yann LeCun,et al.  Second Order Properties of Error Surfaces: Learning Time and Generalization , 1990, NIPS 1990.

[13]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .

[14]  Horst Bunke,et al.  The IAM-database: an English sentence database for offline handwriting recognition , 2002, International Journal on Document Analysis and Recognition.

[15]  Y. Notay Solving positive (semi) definite linear systems by preconditioned iterative methods , 1991 .

[16]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[17]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[18]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[19]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[20]  Charles R. Johnson,et al.  Topics in Matrix Analysis , 1991 .

[21]  Ben Taskar,et al.  Introduction to statistical relational learning , 2007 .

[22]  Horst Bunke,et al.  Hidden Markov model-based ensemble methods for offline handwritten text line recognition , 2008, Pattern Recognit..