Discriminative Learning of Max-Sum Classifiers

The max-sum classifier predicts n-tuple of labels from n-tuple of observable variables by maximizing a sum of quality functions defined over neighbouring pairs of labels and observable variables. Predicting labels as MAP assignments of a Random Markov Field is a particular example of the max-sum classifier. Learning parameters of the max-sum classifier is a challenging problem because even computing the response of such classifier is NP-complete in general. Estimating parameters using the Maximum Likelihood approach is feasible only for a subclass of max-sum classifiers with an acyclic structure of neighbouring pairs. Recently, the discriminative methods represented by the perceptron and the Support Vector Machines, originally designed for binary linear classifiers, have been extended for learning some subclasses of the max-sum classifier. Besides the max-sum classifiers with the acyclic neighbouring structure, it has been shown that the discriminative learning is possible even with arbitrary neighbouring structure provided the quality functions fulfill some additional constraints. In this article, we extend the discriminative approach to other three classes of max-sum classifiers with an arbitrary neighbourhood structure. We derive learning algorithms for two subclasses of max-sum classifiers whose response can be computed in polynomial time: (i) the max-sum classifiers with supermodular quality functions and (ii) the max-sum classifiers whose response can be computed exactly by a linear programming relaxation. Moreover, we show that the learning problem can be approximately solved even for a general max-sum classifier.

[1]  Mark Jerrum,et al.  Polynomial-Time Approximation Algorithms for the Ising Model , 1990, SIAM J. Comput..

[2]  R M Haralick,et al.  The consistent labeling problem: part I. , 1979, IEEE transactions on pattern analysis and machine intelligence.

[3]  Martin J. Wainwright,et al.  MAP estimation via agreement on trees: message-passing and linear programming , 2005, IEEE Transactions on Information Theory.

[4]  Nathan D. Ratliff,et al.  Subgradient Methods for Maximum Margin Structured Learning , 2006 .

[5]  Olga Veksler,et al.  Fast Approximate Energy Minimization via Graph Cuts , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Albert B Novikoff,et al.  ON CONVERGENCE PROOFS FOR PERCEPTRONS , 1963 .

[7]  Vladimir Kolmogorov,et al.  Convergent Tree-Reweighted Message Passing for Energy Minimization , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Geoffrey E. Hinton,et al.  Learning and relearning in Boltzmann machines , 1986 .

[9]  Tomás Werner,et al.  A Linear Programming Approach to Max-Sum Problem: A Review , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Thomas Hofmann,et al.  Hidden Markov Support Vector Machines , 2003, ICML.

[11]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[12]  Ben Taskar,et al.  Learning associative Markov networks , 2004, ICML.

[13]  Joseph Naor,et al.  Approximation algorithms for the metric labeling problem via a new linear programming formulation , 2001, SODA '01.

[14]  Ben Taskar,et al.  Structured Prediction, Dual Extragradient and Bregman Projections , 2006, J. Mach. Learn. Res..

[15]  Thomas Hofmann,et al.  Large margin methods for label sequence learning , 2003, INTERSPEECH.

[16]  M. I. Schlesinger,et al.  Some solvable subclasses of structural recognition problems , 2000 .

[17]  William T. Freeman,et al.  Constructing free-energy approximations and generalized belief propagation algorithms , 2005, IEEE Transactions on Information Theory.

[18]  Vladimir Kolmogorov,et al.  An experimental comparison of min-cut/max- flow algorithms for energy minimization in vision , 2001, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Martin J. Wainwright,et al.  MAP estimation via agreement on (hyper)trees: Message-passing and linear programming , 2005, ArXiv.

[20]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[21]  Arie M. C. A. Koster,et al.  The partial constraint satisfaction problem: Facets and lifting theorems , 1998, Oper. Res. Lett..

[22]  Ben Taskar,et al.  Discriminative learning of Markov random fields for segmentation of 3D scan data , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[23]  Thorsten Joachims,et al.  Supervised clustering with support vector machines , 2005, ICML.

[24]  Václav Hlavác,et al.  Ten Lectures on Statistical and Structural Pattern Recognition , 2002, Computational Imaging and Vision.

[25]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[26]  J. Besag On the Statistical Analysis of Dirty Pictures , 1986 .

[27]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[28]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[29]  김인철,et al.  Consistent Labeling Problem을 풀기 위한 휴우리스틱 탐색 기법 , 1988 .

[30]  Azriel Rosenfeld,et al.  Scene Labeling by Relaxation Operations , 1976, IEEE Transactions on Systems, Man, and Cybernetics.

[31]  Christopher M. Brown,et al.  The theory and practice of Bayesian image labeling , 1990, International Journal of Computer Vision.

[32]  Bernhard Schölkopf,et al.  Extracting Support Data for a Given Task , 1995, KDD.