Exponential Priors for Maximum Entropy Models

Maximum entropy models are a common modeling technique, but prone to overfitting. We show that using an exponential distribution as a prior leads to bounded absolute discounting by a constant. We show that this prior is better motivated by the data than previous techniques such as a Gaussian prior, and often produces lower error rates. Exponential priors also lead to a simpler learning algorithm and to easier to understand behavior. Furthermore, exponential priors help explain the success of some previous smoothing techniques, and suggest simple variations that work better.

[1]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[2]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .

[3]  William I. Newman,et al.  Extension to the maximum entropy method , 1977, IEEE Trans. Inf. Theory.

[4]  Ronald Rosenfeld,et al.  Adaptive Statistical Language Modeling; A Maximum Entropy Approach , 1994 .

[5]  Raymond Lau,et al.  Adaptive statistical language modeling , 1994 .

[6]  Hermann Ney,et al.  On structuring probabilistic dependences in stochastic language modelling , 1994, Comput. Speech Lang..

[7]  Peter M. Williams,et al.  Bayesian Regularization and Pruning Using a Laplace Prior , 1995, Neural Computation.

[8]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[9]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[10]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[11]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Approach to Identifying Sentence Boundaries , 1997, ANLP.

[12]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Mitchell P. Marcus,et al.  Maximum entropy models for natural language ambiguity resolution , 1998 .

[14]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[15]  Ronald Rosenfeld,et al.  A survey of smoothing techniques for ME models , 2000, IEEE Trans. Speech Audio Process..

[16]  Michele Banko,et al.  Mitigating the Paucity of Data Problem , 2001 .

[17]  Joshua Goodman,et al.  Classes for fast maximum entropy training , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[18]  Joshua Goodman,et al.  Sequential Conditional Generalized Iterative Scaling , 2002, ACL.

[19]  David Heckerman,et al.  CFW: A Collaborative Filtering System Using Posteriors over Weights of Evidence , 2002, UAI.

[20]  James Theiler,et al.  Online Feature Selection using Grafting , 2003, ICML.