Learning Efficiently with Approximate Inference via Dual Losses

Many structured prediction tasks involve complex models where inference is computationally intractable, but where it can be well approximated using a linear programming relaxation. Previous approaches for learning for structured prediction (e.g., cutting-plane, subgradient methods, perceptron) repeatedly make predictions for some of the data points. These approaches are computationally demanding because each prediction involves solving a linear program to optimality. We present a scalable algorithm for learning for structured prediction. The main idea is to instead solve the dual of the structured prediction loss. We formulate the learning task as a convex minimization over both the weights and the dual variables corresponding to each data point. As a result, we can begin to optimize the weights even before completely solving any of the individual prediction problems. We show how the dual variables can be efficiently optimized using coordinate descent. Our algorithm is competitive with state-of-the-art methods such as stochastic subgradient and cutting-plane.

[1]  Robert B. Ash,et al.  Information Theory , 2020, The SAGE International Encyclopedia of Mass Media and Society.

[2]  Claude Lemaréchal,et al.  Convergence of some algorithms for convex minimization , 1993, Math. Program..

[3]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[4]  Jason Weston,et al.  A kernel method for multi-labelled classification , 2001, NIPS.

[5]  Dimitri P. Bertsekas,et al.  Incremental Subgradient Methods for Nondifferentiable Optimization , 2001, SIAM J. Optim..

[6]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[7]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[8]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[9]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[10]  Jiebo Luo,et al.  Learning multi-label scene classification , 2004, Pattern Recognit..

[11]  Ben Taskar,et al.  Learning structured prediction models: a large margin approach , 2005, ICML.

[12]  Ben Taskar,et al.  Structured Prediction, Dual Extragradient and Bregman Projections , 2006, J. Mach. Learn. Res..

[13]  Vladimir Kolmogorov,et al.  Convergent Tree-Reweighted Message Passing for Energy Minimization , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Thomas Hofmann,et al.  Predicting Structured Data (Neural Information Processing) , 2007 .

[15]  Gökhan BakIr,et al.  Predicting Structured Data , 2008 .

[16]  Fernando Pereira,et al.  Structured Learning with Approximate Inference , 2007, NIPS.

[17]  Nikos Komodakis,et al.  MRF Optimization via Dual Decomposition: Message-Passing Revisited , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[18]  Nathan Ratliff,et al.  Online) Subgradient Methods for Structured Prediction , 2007 .

[19]  Tommi S. Jaakkola,et al.  Fixing Max-Product: Convergent Message Passing Algorithms for MAP LP-Relaxations , 2007, NIPS.

[20]  Dmitry M. Malioutov,et al.  Lagrangian Relaxation for MAP Estimation in Graphical Models , 2007, ArXiv.

[21]  Tomás Werner,et al.  A Linear Programming Approach to Max-Sum Problem: A Review , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Peter L. Bartlett,et al.  Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks , 2008, J. Mach. Learn. Res..

[23]  Tommi S. Jaakkola,et al.  Tightening LP Relaxations for MAP using Message Passing , 2008, UAI.

[24]  Thorsten Joachims,et al.  Training structural SVMs when exact inference is intractable , 2008, ICML '08.

[25]  David Burshtein,et al.  Iterative Approximate Linear Programming Decoding of LDPC Codes With Linear Complexity , 2008, IEEE Transactions on Information Theory.

[26]  Shai Shalev-Shwartz,et al.  Theory and Practice of Support Vector Machines Optimization , 2009 .

[27]  Eric P. Xing,et al.  Polyhedral outer approximations with application to natural language parsing , 2009, ICML '09.

[28]  Thorsten Joachims,et al.  Cutting-plane training of structural SVMs , 2009, Machine Learning.

[29]  Tommi S. Jaakkola,et al.  Tree Block Coordinate Descent for MAP in Graphical Models , 2009, AISTATS.

[30]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[31]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[32]  K. Schittkowski,et al.  NONLINEAR PROGRAMMING , 2022 .