Maximum Entropy Discrimination Markov Networks

The standard maximum margin approach for structured prediction lacks a straightforward probabilistic interpretation of the learning scheme and the prediction rule. Therefore its unique advantages such as dual sparseness and kernel tricks cannot be easily conjoined with the merits of a probabilistic model such as Bayesian regularization, model averaging, and ability to model hidden variables. In this paper, we present a new general framework called maximum entropy discrimination Markov networks (MaxEnDNet, or simply, MEDN), which integrates these two approaches and combines and extends their merits. Major innovations of this approach include: 1) It extends the conventional max-entropy discrimination learning of classification rules to a new structural max-entropy discrimination paradigm of learning a distribution of Markov networks. 2) It generalizes the extant Markov network structured-prediction rule based on a point estimator of model coefficients to an averaging model akin to a Bayesian predictor that integrates over a learned posterior distribution of model coefficients. 3) It admits flexible entropic regularization of the model during learning. By plugging in different prior distributions of the model coefficients, it subsumes the well-known maximum margin Markov networks (M3N) as a special case, and leads to a model similar to an L1-regularized M3N that is simultaneously primal and dual sparse, or other new types of Markov networks. 4) It applies a modular learning algorithm that combines existing variational inference techniques and convex-optimization based M3N solvers as subroutines. Essentially, MEDN can be understood as a jointly maximum likelihood and maximum margin estimate of Markov network. It represents the first successful attempt to combine maximum entropy learning (a dual form of maximum likelihood learning) with maximum margin learning of Markov network for structured input/output problems; and the basic principle can be generalized to learning arbitrary graphical models, such as the generative Bayesian networks or models with structured hidden variables. We discuss a number of theoretical properties of this approach, and show that empirically it outperforms a wide array of competing methods for structured input/output learning on both synthetic and real OCR and web data extraction data sets.

[1]  John D. Lafferty,et al.  Boosting and Maximum Likelihood for Exponential Models , 2001, NIPS.

[2]  Ulf Brefeld,et al.  Semi-supervised learning for structured output variables , 2006, ICML.

[3]  Yoram Singer,et al.  Efficient projections onto the l1-ball for learning in high dimensions , 2008, ICML '08.

[4]  Bo Zhang,et al.  Dynamic hierarchical Markov random fields and their application to web data extraction , 2007, ICML '07.

[5]  Zhihua Zhang,et al.  Bayesian Multicategory Support Vector Machines , 2006, UAI.

[6]  Mikhail Belkin,et al.  Maximum Margin Semi-Supervised Learning for Structured Variables , 2005, NIPS 2005.

[7]  Robert Tibshirani,et al.  1-norm Support Vector Machines , 2003, NIPS.

[8]  O. Mangasarian,et al.  Robust linear programming discrimination of two linearly inseparable sets , 1992 .

[9]  Mário A. T. Figueiredo Adaptive Sparseness for Supervised Learning , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Bo Zhang,et al.  Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction , 2008, J. Mach. Learn. Res..

[11]  George Eastman House,et al.  Sparse Bayesian Learning and the Relevance Vector Machine , 2001 .

[12]  Daphne Koller,et al.  Efficient Structure Learning of Markov Networks using L1-Regularization , 2006, NIPS.

[13]  Thomas Hofmann,et al.  Support vector machine learning for interdependent and structured output spaces , 2004, ICML.

[14]  Gökhan BakIr,et al.  Generalization Bounds and Consistency for Structured Labeling , 2007 .

[15]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[16]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[17]  Peter Sollich,et al.  Bayesian Methods for Support Vector Machines: Evidence and Predictive Class Probabilities , 2002, Machine Learning.

[18]  Koby Crammer,et al.  Confidence-weighted linear classification , 2008, ICML '08.

[19]  Ben Taskar,et al.  Exponentiated Gradient Algorithms for Large-margin Structured Classification , 2004, NIPS.

[20]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[21]  John Shawe-Taylor,et al.  PAC-Bayes & Margins , 2002, NIPS.

[22]  Tony Jebara,et al.  Multitask Sparsity via Maximum Entropy Discrimination , 2011, J. Mach. Learn. Res..

[23]  Dale Schuurmans,et al.  Discriminative unsupervised learning of structured predictors , 2006, ICML.

[24]  Trevor Darrell,et al.  Conditional Random Fields for Object Recognition , 2004, NIPS.

[25]  Martin J. Wainwright,et al.  High-Dimensional Graphical Model Selection Using ℓ1-Regularized Logistic Regression , 2006, NIPS.

[26]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[27]  Lawrence R. Rabiner,et al.  A tutorial on Hidden Markov Models , 1986 .

[28]  Yuan Qi,et al.  Bayesian Conditional Random Fields , 2005, AISTATS.

[29]  Nuno Vasconcelos,et al.  Direct convex relaxations of sparse SVM , 2007, ICML '07.

[30]  Bhaskar D. Rao,et al.  Perspectives on Sparse Bayesian Learning , 2003, NIPS.

[31]  Ata Kabán,et al.  On Bayesian classification with Laplace priors , 2007, Pattern Recognit. Lett..

[32]  Miroslav Dudík,et al.  Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling , 2007, J. Mach. Learn. Res..

[33]  Bo Zhang,et al.  Laplace maximum margin Markov networks , 2008, ICML '08.

[34]  Eric P. Xing,et al.  MedLDA: maximum margin supervised topic models for regression and classification , 2009, ICML '09.

[35]  Ben Taskar,et al.  Structured Prediction via the Extragradient Method , 2005, NIPS.

[36]  D. F. Andrews,et al.  Scale Mixtures of Normal Distributions , 1974 .

[37]  D. Madigan,et al.  Sparse Bayesian Classifiers for Text Categorization , 2003 .

[38]  Tommi S. Jaakkola,et al.  Maximum Entropy Discrimination , 1999, NIPS.

[39]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[40]  Mário A. T. Figueiredo Adaptive Sparseness Using Jeffreys Prior , 2001, NIPS.

[41]  Bo Zhang,et al.  Partially Observed Maximum Entropy Discrimination Markov Networks , 2008, NIPS.

[42]  Xavier Carreras,et al.  Exponentiated gradient algorithms for log-linear structured prediction , 2007, ICML '07.

[43]  Jianfeng Gao,et al.  Scalable training of L1-regularized log-linear models , 2007, ICML '07.

[44]  John Langford,et al.  An Improved Predictive Accuracy Bound for Averaging Classifiers , 2001, ICML.

[45]  Tommi S. Jaakkola,et al.  Feature Selection and Dualities in Maximum Entropy Discrimination , 2000, UAI.

[46]  Thomas Hofmann,et al.  Hidden Markov Support Vector Machines , 2003, ICML.

[47]  Jun Zhu,et al.  On primal and dual sparsity of Markov networks , 2009, ICML '09.

[48]  Nathan Ratliff,et al.  Online) Subgradient Methods for Structured Prediction , 2007 .

[49]  David A. McAllester PAC-Bayesian model averaging , 1999, COLT '99.

[50]  Bo Zhang,et al.  Primal sparse Max-margin Markov networks , 2009, KDD.

[51]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[52]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .