Mixtures of Conditional Maximum Entropy Models

Driven by successes in several application areas, maximumen tropy modeling has recently gained considerable popularity. We generalize the standard maximum entropy formulation of classification problems to better handle the case where complex data distributions arise from a mixture of simpler underlying (latent) distributions. We develop a theoretical framework for characterizing data as a mixture of maximum entropy models. We formulate a maximum-likelihood interpretation of the mixture model learning, and derive a generalized EM algorithm to solve the corresponding optimization problem. We present empirical results for a number of data sets showing that modeling the data as a mixture of latent maximumen tropy models gives significant improvement over the standard, single component, maximum entropy approach.

[1]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .

[2]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[3]  E. T. Jaynes,et al.  Where do we Stand on Maximum Entropy , 1979 .

[4]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[5]  R. Wolfinger,et al.  Generalized linear mixed models a pseudo-likelihood approach , 1993 .

[6]  Cao Feng,et al.  STATLOG: COMPARISON OF CLASSIFICATION ALGORITHMS ON LARGE REAL-WORLD PROBLEMS , 1995 .

[7]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[8]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[9]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[10]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[11]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Tom M. Mitchell,et al.  Learning to Extract Symbolic Knowledge from the World Wide Web , 1998, AAAI/IAAI.

[13]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[14]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.

[15]  K. Train,et al.  Mixed Logit with Repeated Choices: Households' Choices of Appliance Efficiency Level , 1998, Review of Economics and Statistics.

[16]  Stanley F. Chen,et al.  A Gaussian Prior for Smoothing Maximum Entropy Models , 1999 .

[17]  EntropyModelsStanley,et al.  A Gaussian Prior for Smoothing Maximum , 1999 .

[18]  Andrew McCallum,et al.  Using Maximum Entropy for Text Classification , 1999 .

[19]  Tommi S. Jaakkola,et al.  Maximum Entropy Discrimination , 1999, NIPS.

[20]  D. McFadden,et al.  MIXED MNL MODELS FOR DISCRETE RESPONSE , 2000 .

[21]  Lyle H. Ungar,et al.  Maximum entropy methods for biological sequence modeling , 2001, BIOKDD.

[22]  Padhraic Smyth,et al.  Probabilistic query models for transaction data , 2001, KDD '01.

[23]  Dale Schuurmans,et al.  The latent maximum entropy principle , 2002, Proceedings IEEE International Symposium on Information Theory,.

[24]  Joshua Goodman,et al.  Sequential Conditional Generalized Iterative Scaling , 2002, ACL.

[25]  David M. Pennock,et al.  A Maximum Entropy Approach to Collaborative Filtering in Dynamic, Sparse, High-Dimensional Domains , 2002, NIPS.

[26]  Wei-Yin Loh,et al.  A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-Three Old and New Classification Algorithms , 2000, Machine Learning.