Efficient Training of Structured SVMs via Soft Constraints

Structured output prediction is a powerful framework for jointly predicting interdependent output labels. Learning the parameters of structured predictors is a central task in machine learning applications. However, training the model from data often becomes computationally expensive. Several methods have been proposed to exploit the model structure, or decomposition, in order to obtain ecient training algorithms. In particular, methods based on linear programming relaxation, or dual decomposition, decompose the prediction task into multiple simpler prediction tasks and enforce agreement between overlapping predictions. In this work we observe that relaxing these agreement constraints and replacing them with soft constraints yields a much easier optimization problem. Based on this insight we propose an alternative training objective, analyze its theoretical properties, and derive an algorithm for its optimization. Our method, based on the Frank-Wolfe algorithm, achieves significant speedups over existing state-of-theart methods without hurting prediction accuracy.

[1]  A. Fiacco,et al.  Survey of penalty, exact-penalty and multiplier methods from 1968 to 1993 ∗ , 1995 .

[2]  Thorsten Joachims,et al.  Training structural SVMs when exact inference is intractable , 2008, ICML '08.

[3]  D. Sontag 1 Introduction to Dual Decomposition for Inference , 2010 .

[4]  Philip Wolfe,et al.  An algorithm for quadratic programming , 1956 .

[5]  Tommi S. Jaakkola,et al.  Convergence Rate Analysis of MAP Coordinate Minimization Algorithms , 2012, NIPS.

[6]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[7]  Alexander M. Rush,et al.  Dual Decomposition for Parsing with Non-Projective Head Automata , 2010, EMNLP.

[8]  Tommi S. Jaakkola,et al.  Introduction to dual composition for inference , 2011 .

[9]  Tommi S. Jaakkola,et al.  Tightening LP Relaxations for MAP using Message Passing , 2008, UAI.

[10]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[11]  Tong Zhang,et al.  Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization , 2013, Mathematical Programming.

[12]  Ohad Shamir,et al.  Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes , 2012, ICML.

[13]  Thomas Hofmann,et al.  Support vector machine learning for interdependent and structured output spaces , 2004, ICML.

[14]  Tommi S. Jaakkola,et al.  Learning Efficiently with Approximate Inference via Dual Losses , 2010, ICML.

[15]  Fernando Pereira,et al.  Structured Learning with Approximate Inference , 2007, NIPS.

[16]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[17]  Nathan Srebro,et al.  SVM optimization: inverse dependence on training set size , 2008, ICML '08.

[18]  Dan Roth,et al.  Efficient Decomposed Learning for Structured Prediction , 2012, ICML.

[19]  Tomás Werner,et al.  High-arity interactions, polyhedral relaxations, and cutting plane algorithm for soft constraint optimisation (MAP-MRF) , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Mark W. Schmidt,et al.  Block-Coordinate Frank-Wolfe Optimization for Structural SVMs , 2012, ICML.

[21]  Ben Taskar,et al.  Structured Prediction Cascades , 2010, AISTATS.

[22]  Tamir Hazan,et al.  A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction , 2010, NIPS.

[23]  Sanja Fidler,et al.  Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Andrew McCallum,et al.  Message Passing for Soft Constraint Dual Decomposition , 2014, UAI.

[25]  Peter L. Bartlett,et al.  Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks , 2008, J. Mach. Learn. Res..

[26]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Shai Shalev-Shwartz,et al.  Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[28]  Nathan Ratliff,et al.  Online) Subgradient Methods for Structured Prediction , 2007 .

[29]  Ben Taskar,et al.  Learning structured prediction models: a large margin approach , 2005, ICML.

[30]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[31]  Nikos Komodakis,et al.  Efficient training for pairwise or higher order CRFs via dual decomposition , 2011, CVPR 2011.

[32]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..