Maximum Entropy Models with Inequality Constraints: A Case Study on Text Categorization

Data sparseness or overfitting is a serious problem in natural language processing employing machine learning methods. This is still true even for the maximum entropy (ME) method, whose flexible modeling capability has alleviated data sparseness more successfully than the other probabilistic models in many NLP tasks. Although we usually estimate the model so that it completely satisfies the equality constraints on feature expectations with the ME method, complete satisfaction leads to undesirable overfitting, especially for sparse features, since the constraints derived from a limited amount of training data are always uncertain. To control overfitting in ME estimation, we propose the use of box-type inequality constraints, where equality can be violated up to certain predefined levels that reflect this uncertainty. The derived models, inequality ME models, in effect have regularized estimation with L1 norm penalties of bounded parameters. Most importantly, this regularized estimation enables the model parameters to become sparse. This can be thought of as automatic feature selection, which is expected to improve generalization performance further. We evaluate the inequality ME models on text categorization datasets, and demonstrate their advantages over standard ME estimation, similarly motivated Gaussian MAP estimation of ME models, and support vector machines (SVMs), which are one of the state-of-the-art methods for text categorization.

[1]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[2]  Tsujii Jun'ichi,et al.  Maximum entropy estimation for feature forests , 2002 .

[3]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[4]  Andrew McCallum,et al.  Efficiently Inducing Features of Conditional Random Fields , 2002, UAI.

[5]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Stan Matwin,et al.  Feature Engineering for Text Classification , 1999, ICML.

[7]  Dunja Mladenic,et al.  Word sequences as features in text-learning , 1998 .

[8]  Thomas P. Minka,et al.  Algorithms for maximum-likelihood logistic regression , 2003 .

[9]  S. Fang,et al.  Entropy Optimization and Mathematical Programming , 1997 .

[10]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[11]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[12]  Ronald Rosenfeld,et al.  A survey of smoothing techniques for ME models , 2000, IEEE Trans. Speech Audio Process..

[13]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[14]  Andrew McCallum,et al.  Using Maximum Entropy for Text Classification , 1999 .

[15]  David D. Lewis,et al.  Feature Selection and Feature Extraction for Text Categorization , 1992, HLT.

[16]  Joshua Goodman,et al.  Exponential Priors for Maximum Entropy Models , 2004, NAACL.

[17]  Yuan-Fang Wang,et al.  The use of bigrams to enhance text categorization , 2002, Inf. Process. Manag..

[18]  Rob Malouf,et al.  A Comparison of Algorithms for Maximum Entropy Parameter Estimation , 2002, CoNLL.

[19]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .

[20]  William I. Newman,et al.  Extension to the maximum entropy method , 1977, IEEE Trans. Inf. Theory.

[21]  Mark Johnson,et al.  Exploiting auxiliary distributions in stochastic unification-based grammars , 2000, ANLP.

[22]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[23]  Jun'ichi Tsujii,et al.  Evaluation and Extension of Maximum Entropy Models with Inequality Constraints , 2003, EMNLP.

[24]  Lois Curfman McInnes,et al.  TAO users manual. , 2003 .

[25]  風間 淳一,et al.  Improving maximum entropy natural language processing by uncertainty-aware extensions and unsupervised learning , 2004 .

[26]  Chris Buckley,et al.  OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.

[27]  Mark Johnson,et al.  Estimators for Stochastic “Unification-Based” Grammars , 1999, ACL.

[28]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[29]  Lide Wu,et al.  A Fast Algorithm for Feature Selection in Conditional Maximum Entropy Modeling , 2003, EMNLP.

[30]  Ralph Grishman,et al.  A Maximum Entropy Approach to Named Entity Recognition , 1999 .

[31]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[32]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[33]  James R. Curran,et al.  Investigating GIS and Smoothing for Maximum Entropy Taggers , 2003, EACL.

[34]  Raymond Lau,et al.  Adaptive statistical language modeling , 1994 .

[35]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[36]  Thorsten Joachims,et al.  Text categorization with support vector machines , 1999 .

[37]  Stanley F. Chen,et al.  A Gaussian Prior for Smoothing Maximum Entropy Models , 1999 .