A Comparison of Algorithms for Maximum Entropy Parameter Estimation

Conditional maximum entropy (ME) models provide a general purpose machine learning technique which has been successfully applied to fields as diverse as computer vision and econometrics, and which is used for a wide variety of classification problems in natural language processing. However, the flexibility of ME models is not without cost. While parameter estimation for ME models is conceptually straightforward, in practice ME models for typical natural language tasks are very large, and may well contain many thousands of free parameters. In this paper, we consider a number of algorithms for estimating the parameters of ME models, including iterative scaling, gradient ascent, conjugate gradient, and variable metric methods. Sur-prisingly, the standardly used iterative scaling algorithms perform quite poorly in comparison to the others, and for all of the test problems, a limited-memory variable metric algorithm outperformed the other choices.

[1]  W. Deming,et al.  On a Least Squares Adjustment of a Sampled Frequency Table When the Expected Marginal Totals are Known , 1940 .

[2]  E. Jaynes Information Theory and Statistical Mechanics , 1957 .

[3]  I. Good Maximum Entropy for Hypothesis Formulation, Especially for Multidimensional Contingency Tables , 1963 .

[4]  L. Campbell Equivalence of Gauss's Principle and Minimum Discrimination Information Estimation of Probabilities , 1970 .

[5]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .

[6]  M. J. D. Powell,et al.  on The state of the art in numerical analysis , 1987 .

[7]  J. Shewchuk An Introduction to the Conjugate Gradient Method Without the Agonizing Pain , 1994 .

[8]  John D. Lafferty,et al.  Cluster Expansions and Iterative Scaling for Maximum Entropy Language Models , 1995, ArXiv.

[9]  William Gropp,et al.  Efficient Management of Parallelism in Object-Oriented Numerical Software Libraries , 1997, SciTools.

[10]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[11]  Jorge Nocedal,et al.  Large Scale Unconstrained Optimization , 1997 .

[12]  Steven P. Abney Stochastic Attribute-Value Grammars , 1996, CL.

[13]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  Stuart Gonan,et al.  Probability models for complex systems , 1998 .

[15]  Mitchell P. Marcus,et al.  Maximum entropy models for natural language ambiguity resolution , 1998 .

[16]  Mark Johnson,et al.  Estimators for Stochastic “Unification-Based” Grammars , 1999, ACL.

[17]  Jun Wu,et al.  Efficient training methods for maximum entropy language modeling , 2000, INTERSPEECH.

[18]  Gertjan van Noord,et al.  Alpino: Wide-coverage Computational Analysis of Dutch , 2000, CLIN.

[19]  Lois C. McInnes,et al.  A case study in the performance and scalability of optimization algorithms , 2001, TOMS.

[20]  Todd Munson,et al.  Benchmarking optimization software with COPS. , 2001 .

[21]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[22]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[23]  Jorge J. Moré,et al.  Digital Object Identifier (DOI) 10.1007/s101070100263 , 2001 .

[24]  Miles Osborne,et al.  Using maximum entropy for sentence extraction , 2002, ACL 2002.

[25]  Miles Osborne,et al.  Shallow Parsing using Noisy and Non-Stationary Training Material , 2002, J. Mach. Learn. Res..

[26]  Lois Curfman McInnes,et al.  TAO users manual. , 2003 .

[27]  Thomas P. Minka,et al.  Algorithms for maximum-likelihood logistic regression , 2003 .

[28]  King-Sun Fu,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence Publication Information , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.