A Smoother Way to Train Structured Prediction Models

We present a framework to train a structured prediction model by performing smoothing on the inference algorithm it builds upon. Smoothing overcomes the non-smoothness inherent to the maximum margin structured prediction objective, and paves the way for the use of fast primal gradient-based optimization algorithms. We illustrate the proposed framework by developing a novel primal incremental optimization algorithm for the structural support vector machine. The proposed algorithm blends an extrapolation scheme for acceleration and an adaptive smoothing scheme and builds upon the stochastic variance-reduced gradient algorithm. We establish its worst-case global complexity bound and study several practical variants. We present experimental results on two real-world problems, namely named entity recognition and visual object localization. The experimental results show that the proposed framework allows us to build upon efficient inference algorithms to develop large-scale optimization algorithms for structured prediction which can achieve competitive performance on the two real-world problems.

[1]  Alexander J. Smola,et al.  Bundle Methods for Regularized Risk Minimization , 2010, J. Mach. Learn. Res..

[2]  Koen E. A. van de Sande,et al.  Segmentation as selective search for object recognition , 2011, 2011 International Conference on Computer Vision.

[3]  Tommi S. Jaakkola,et al.  Learning Efficiently with Approximate Inference via Dual Losses , 2010, ICML.

[4]  J. Andrew Bagnell,et al.  (Approximate) Subgradient Methods for Structured Prediction , 2007, International Conference on Artificial Intelligence and Statistics.

[5]  Zaïd Harchaoui,et al.  On learning to localize objects with minimal supervision , 2014, ICML.

[6]  Rina Dechter,et al.  Searching for the M Best Solutions in Graphical Models , 2016, J. Artif. Intell. Res..

[7]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[8]  Arthur Mensch,et al.  Differentiable Dynamic Programming for Structured Prediction and Attention , 2018, ICML.

[9]  Thomas Hofmann,et al.  Hidden Markov Support Vector Machines , 2003, ICML.

[10]  Thomas Deselaers,et al.  Localizing Objects While Learning Their Appearance , 2010, ECCV.

[11]  James V. Burke,et al.  Descent methods for composite nondifferentiable optimization problems , 1985, Math. Program..

[12]  Aaron Defazio,et al.  A Simple Practical Accelerated Method for Finite Sums , 2016, NIPS.

[13]  Aurélien Lucchi,et al.  Variance Reduced Stochastic Gradient Descent with Neighbors , 2015, NIPS.

[14]  Christoph Schnörr,et al.  A study of Nesterov's scheme for Lagrangian decomposition and MAP labeling , 2011, CVPR 2011.

[15]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[16]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[17]  Sham M. Kakade,et al.  Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization , 2015, ICML.

[18]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[19]  Zaïd Harchaoui,et al.  Catalyst for Gradient-based Nonconvex Optimization , 2018, AISTATS.

[20]  D. Greig,et al.  Exact Maximum A Posteriori Estimation for Binary Images , 1989 .

[21]  Stephen Gould,et al.  Accelerated dual decomposition for MAP inference , 2010, ICML.

[22]  Anton Osokin,et al.  Minding the Gaps for Block Frank-Wolfe Optimization of Structured SVMs , 2016, ICML.

[23]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[24]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[25]  Pushmeet Kohli,et al.  Measuring uncertainty in graph cut solutions , 2008, Comput. Vis. Image Underst..

[26]  Christoph H. Lampert,et al.  Beyond sliding windows: Object localization by efficient subwindow search , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Jung-Fu Cheng,et al.  Turbo Decoding as an Instance of Pearl's "Belief Propagation" Algorithm , 1998, IEEE J. Sel. Areas Commun..

[28]  Ben Taskar,et al.  Structured Prediction, Dual Extragradient and Bregman Projections , 2006, J. Mach. Learn. Res..

[29]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[30]  Dmitriy Drusvyatskiy,et al.  Stochastic model-based minimization of weakly convex functions , 2018, SIAM J. Optim..

[31]  Amir Globerson,et al.  An LP View of the M-best MAP problem , 2009, NIPS.

[32]  Tamir Hazan,et al.  Blending Learning and Inference in Conditional Random Fields , 2016, J. Mach. Learn. Res..

[33]  Tong Zhang,et al.  Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization , 2013, Math. Program..

[34]  Vladimir Kolmogorov,et al.  What energy functions can be minimized via graph cuts? , 2002, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  David K. Smith,et al.  Dynamic Programming and Optimal Control. Volume 1 , 1996 .

[36]  Thomas Hofmann,et al.  Support vector machine learning for interdependent and structured output spaces , 2004, ICML.

[37]  Mark Jerrum,et al.  Polynomial-Time Approximation Algorithms for the Ising Model , 1990, SIAM J. Comput..

[38]  Mark W. Schmidt,et al.  Minimizing finite sums with the stochastic average gradient , 2013, Mathematical Programming.

[39]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[40]  Thorsten Joachims,et al.  Cutting-plane training of structural SVMs , 2009, Machine Learning.

[41]  Mark Steedman,et al.  A* CCG Parsing with a Supertag-factored Model , 2014, EMNLP.

[42]  Maksim Tkatchenko,et al.  Named entity recognition: Exploring features , 2012, KONVENS.

[43]  Arkadi Nemirovski,et al.  Dual subgradient algorithms for large-scale nonsmooth learning problems , 2013, Math. Program..

[44]  Xinhua Zhang,et al.  Accelerated training of max-margin Markov networks with kernels , 2011, Theor. Comput. Sci..

[45]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[46]  Michael I. Jordan,et al.  Loopy Belief Propagation for Approximate Inference: An Empirical Study , 1999, UAI.

[47]  Mark W. Schmidt,et al.  Block-Coordinate Frank-Wolfe Optimization for Structural SVMs , 2012, ICML.

[48]  Peter L. Bartlett,et al.  Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks , 2008, J. Mach. Learn. Res..

[49]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[50]  Christoph H. Lampert,et al.  Computing the M Most Probable Modes of a Graphical Model , 2013, AISTATS.

[51]  László Lovász,et al.  Submodular functions and convexity , 1982, ISMP.

[52]  Zaïd Harchaoui,et al.  A Universal Catalyst for First-Order Optimization , 2015, NIPS.

[53]  R. Bellman Dynamic programming. , 1957, Science.

[54]  A. P. Dawid,et al.  Applications of a general propagation algorithm for probabilistic expert systems , 1992 .

[55]  Patrick Gallinari,et al.  A Framework for the Cooperation of Learning Algorithms , 1990, NIPS.

[56]  Zaïd Harchaoui,et al.  Semi-Proximal Mirror-Prox for Nonsmooth Composite Minimization , 2015, NIPS.

[57]  Tamir Hazan,et al.  A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction , 2010, NIPS.

[58]  Yurii Nesterov,et al.  Smooth minimization of non-smooth functions , 2005, Math. Program..

[59]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[60]  Francis R. Bach,et al.  Stochastic Variance Reduction Methods for Saddle-Point Problems , 2016, NIPS.

[61]  Martin J. Wainwright,et al.  MAP estimation via agreement on trees: message-passing and linear programming , 2005, IEEE Transactions on Information Theory.

[62]  Davi Geiger,et al.  Segmentation by grouping junctions , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[63]  Jason K. Johnson,et al.  Convex relaxation methods for graphical models: Lagrangian and maximum entropy approaches , 2008 .

[64]  Allen R. Hanson,et al.  Maximum-weight bipartite matching technique and its application in image feature matching , 1996, Other Conferences.

[65]  Gregory F. Cooper,et al.  The Computational Complexity of Probabilistic Inference Using Bayesian Belief Networks , 1990, Artif. Intell..

[66]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[67]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[68]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[69]  Ramón Fernández Astudillo,et al.  From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification , 2016, ICML.

[70]  Y. Weiss,et al.  Finding the M Most Probable Configurations using Loopy Belief Propagation , 2003, NIPS 2003.

[71]  Mark W. Schmidt,et al.  A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method , 2012, ArXiv.

[72]  Yurii Nesterov,et al.  Excessive Gap Technique in Nonsmooth Convex Minimization , 2005, SIAM J. Optim..

[73]  D. Nilsson,et al.  An efficient algorithm for finding the M most probable configurationsin probabilistic expert systems , 1998, Stat. Comput..

[74]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[75]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[76]  Tommi S. Jaakkola,et al.  Convergence Rate Analysis of MAP Coordinate Minimization Algorithms , 2012, NIPS.

[77]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[78]  Dhruv Batra,et al.  An Efficient Message-Passing Algorithm for the M-Best MAP Problem , 2012, UAI.

[79]  Andrew McCallum,et al.  Structured Prediction Energy Networks , 2015, ICML.

[80]  Philip Wolfe,et al.  Validation of subgradient optimization , 1974, Math. Program..

[81]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[82]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[83]  Zeyuan Allen-Zhu,et al.  Katyusha: the first direct acceleration of stochastic gradient methods , 2016, J. Mach. Learn. Res..

[84]  Daniel Marcu,et al.  Learning as search optimization: approximate large margin methods for structured prediction , 2005, ICML.

[85]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[86]  Jean-Louis Golmard,et al.  An algorithm directly finding the K most probable configurations in Bayesian networks , 1994, Int. J. Approx. Reason..

[87]  Dmitriy Drusvyatskiy,et al.  Efficiency of minimizing compositions of convex functions and smooth maps , 2016, Math. Program..

[88]  Mark W. Schmidt,et al.  Non-Uniform Stochastic Average Gradient Method for Training Conditional Random Fields , 2015, AISTATS.

[89]  Shai Shalev-Shwartz,et al.  Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[90]  Nathan Srebro,et al.  Tight Complexity Bounds for Optimizing Composite Objectives , 2016, NIPS.

[91]  Claire Cardie,et al.  SparseMAP: Differentiable Sparse Structured Inference , 2018, ICML.

[92]  Daniel Tarlow,et al.  Using Combinatorial Optimization within Max-Product Belief Propagation , 2006, NIPS.

[93]  Yoshua Bengio,et al.  LeRec: A NN/HMM Hybrid for On-Line Handwriting Recognition , 1995, Neural Computation.

[94]  Luke S. Zettlemoyer,et al.  Deep Semantic Role Labeling: What Works and What’s Next , 2017, ACL.

[95]  Julien Mairal,et al.  Incremental Majorization-Minimization Optimization with Application to Large-Scale Machine Learning , 2014, SIAM J. Optim..

[96]  Gregory Shakhnarovich,et al.  Diverse M-Best Solutions in Markov Random Fields , 2012, ECCV.

[97]  Mark W. Schmidt,et al.  A Stochastic Gradient Method with an Exponential Convergence Rate for Strongly-Convex Optimization with Finite Training Sets , 2012, ArXiv.

[98]  Yoshua Bengio,et al.  Global training of document processing systems using graph transformer networks , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[99]  Zaïd Harchaoui,et al.  Catalyst Acceleration for First-order Convex Optimization: from Theory to Practice , 2017, J. Mach. Learn. Res..

[100]  Ben Taskar,et al.  A Discriminative Matching Approach to Word Alignment , 2005, HLT.

[101]  Alexander Schrijver,et al.  Combinatorial optimization. Polyhedra and efficiency. , 2003 .

[102]  Marc Teboulle,et al.  Smoothing and First Order Methods: A Unified Framework , 2012, SIAM J. Optim..