Bethe Learning of Conditional Random Fields via MAP Decoding

Many machine learning tasks can be formulated in terms of predicting structured outputs. In frameworks such as the structured support vector machine (SVM-Struct) and the structured perceptron, discriminative functions are learned by iteratively applying efficient maximum a posteriori (MAP) decoding. However, maximum likelihood estimation (MLE) of probabilistic models over these same structured spaces requires computing partition functions, which is generally intractable. This paper presents a method for learning discrete exponential family models using the Bethe approximation to the MLE. Remarkably, this problem also reduces to iterative (MAP) decoding. This connection emerges by combining the Bethe approximation with a Frank-Wolfe (FW) algorithm on a convex dual objective which circumvents the intractable partition function. The result is a new single loop algorithm MLE-Struct, which is substantially more efficient than previous double-loop methods for approximate maximum likelihood estimation. Our algorithm outperforms existing methods in experiments involving image segmentation, matching problems from vision, and a new dataset of university roommate assignments.

[1]  Eric Vigoda,et al.  A polynomial-time approximation algorithm for the permanent of a matrix with nonnegative entries , 2004, JACM.

[2]  Jin Yu,et al.  Exponential Family Graph Matching and Ranking , 2009, NIPS.

[3]  Andrew V. Goldberg,et al.  An efficient cost scaling algorithm for the assignment problem , 1995, Math. Program..

[4]  Leonid Gurvits,et al.  Unleashing the power of Schrijver's permanental inequality with the help of the Bethe Approximation , 2011, Electron. Colloquium Comput. Complex..

[5]  Mark Huber,et al.  Fast approximation of the permanent for very dense problems , 2008, SODA '08.

[6]  Amir Globerson,et al.  Convergent message passing algorithms - a unifying view , 2009, UAI.

[7]  Sekhar Tatikonda,et al.  Message-Passing Algorithms: Reparameterizations and Splittings , 2010, IEEE Transactions on Information Theory.

[8]  P. O. Vontobel,et al.  The Bethe Permanent of a Nonnegative Matrix , 2011, IEEE Transactions on Information Theory.

[9]  George Papandreou,et al.  Perturb-and-MAP random fields: Using discrete optimization to learn and sample from energy models , 2011, 2011 International Conference on Computer Vision.

[10]  Amir Globerson,et al.  What Cannot be Learned with Bethe Approximations , 2011, UAI.

[11]  Martin J. Wainwright,et al.  Tree-based reparameterization framework for analysis of sum-product and related algorithms , 2003, IEEE Trans. Inf. Theory.

[12]  Philip Wolfe,et al.  An algorithm for quadratic programming , 1956 .

[13]  Thomas Hofmann,et al.  Support vector machine learning for interdependent and structured output spaces , 2004, ICML.

[14]  Martin Jaggi,et al.  Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization , 2013, ICML.

[15]  Shimon Ullman,et al.  Class-Specific, Top-Down Segmentation , 2002, ECCV.

[16]  Thorsten Joachims,et al.  Training structural SVMs when exact inference is intractable , 2008, ICML '08.

[17]  Richard Zemel,et al.  Efficient Feature Learning Using Perturb-and-MAP , 2013 .

[18]  Leslie G. Valiant,et al.  The Complexity of Computing the Permanent , 1979, Theor. Comput. Sci..

[19]  Vladimir Kolmogorov,et al.  Optimizing Binary MRFs via Extended Roof Duality , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Christian Borgs,et al.  Belief Propagation for Weighted b-Matchings on Arbitrary Graphs and its Relation to Linear Programs with Integer Solutions , 2007, SIAM J. Discret. Math..

[21]  M. Sion On general minimax theorems , 1958 .

[22]  Martin J. Wainwright,et al.  Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting , 2006, J. Mach. Learn. Res..

[23]  Justin Domke,et al.  Learning Graphical Model Parameters with Approximate Marginal Inference , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Vladimir Kolmogorov,et al.  Blossom V: a new implementation of a minimum cost perfect matching algorithm , 2009, Math. Program. Comput..

[25]  Bert Huang,et al.  Approximating the Permanent with Belief Propagation , 2009, ArXiv.

[26]  Yair Weiss,et al.  MAP Estimation, Linear Programming and Belief Propagation with Convex Free Energies , 2007, UAI.

[27]  Alexander J. Smola,et al.  Learning Graph Matching , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[29]  Tommi S. Jaakkola,et al.  New Outer Bounds on the Marginal Polytope , 2007, NIPS.

[30]  Nicol N. Schraudolph,et al.  Efficient Exact Inference in Planar Ising Models , 2008, NIPS.

[31]  Daphne Koller,et al.  Constrained Approximate Maximum Entropy Learning of Markov Random Fields , 2008, UAI.

[32]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[33]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[34]  Michael Chertkov,et al.  Approximating the permanent with fractional belief propagation , 2011, J. Mach. Learn. Res..

[35]  Mark W. Schmidt,et al.  Accelerated training of conditional random fields with stochastic gradient methods , 2006, ICML.

[36]  Tamir Hazan,et al.  Convergent Message-Passing Algorithms for Inference over General Graphs with Convex Free Energies , 2008, UAI.

[37]  Andrew McCallum,et al.  Piecewise Training for Undirected Models , 2005, UAI.

[38]  Hilbert J. Kappen,et al.  Sufficient Conditions for Convergence of the Sum–Product Algorithm , 2005, IEEE Transactions on Information Theory.

[39]  Mark W. Schmidt,et al.  Block-Coordinate Frank-Wolfe Optimization for Structural SVMs , 2012, ICML.