Differentiable Dynamic Programming for Structured Prediction and Attention

Dynamic programming (DP) solves a variety of structured combinatorial problems by iteratively breaking them down into smaller subproblems. In spite of their versatility, DP algorithms are usually non-differentiable, which hampers their use as a layer in neural networks trained by backpropagation. To address this issue, we propose to smooth the max operator in the dynamic programming recursion, using a strongly convex regularizer. This allows to relax both the optimal value and solution of the original combinatorial problem, and turns a broad class of DP algorithms into differentiable operators. Theoretically, we provide a new probabilistic perspective on backpropagating through these DP operators, and relate them to inference in graphical models. We derive two particular instantiations of our framework, a smoothed Viterbi algorithm for sequence prediction and a smoothed DTW algorithm for time-series alignment. We showcase these instantiations on two structured prediction tasks and on structured and sparse attention for neural machine translation.

[1]  Robert A. Sulanke,et al.  OBJECTS COUNTED BY THE CENTRAL DELANNOY NUMBERS , 2003 .

[2]  A. Fiacco A Finite Algorithm for Finding the Projection of a Point onto the Canonical Simplex of R " , 2009 .

[3]  L. Baum,et al.  Statistical Inference for Probabilistic Functions of Finite State Markov Chains , 1966 .

[4]  Yoram Singer,et al.  Efficient projections onto the l1-ball for learning in high dimensions , 2008, ICML '08.

[5]  Claire Cardie,et al.  SparseMAP: Differentiable Sparse Structured Inference , 2018, ICML.

[6]  S. Verdú,et al.  Abstract dynamic programming models under commutativity conditions , 1987 .

[7]  David A. Smith,et al.  Minimum Risk Annealing for Training Log-Linear Models , 2006, ACL.

[8]  J. Zico Kolter,et al.  OptNet: Differentiable Optimization as a Layer in Neural Networks , 2017, ICML.

[9]  D. Bertsekas Control of uncertain systems with a set-membership description of the uncertainty , 1971 .

[10]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[11]  J. Danskin The Theory of Max-Min, with Applications , 1966 .

[12]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[13]  Vlad Niculae,et al.  A Regularized Framework for Sparse and Structured Neural Attention , 2017, NIPS.

[14]  Cyril Banderier,et al.  Why Delannoy numbers? , 2004, ArXiv.

[15]  Marc Teboulle,et al.  Smoothing and First Order Methods: A Unified Framework , 2012, SIAM J. Optim..

[16]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[17]  R Bellman,et al.  On the Theory of Dynamic Programming. , 1952, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Eduard H. Hovy,et al.  End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF , 2016, ACL.

[19]  Alexander M. Rush,et al.  Structured Attention Networks , 2017, ICLR.

[20]  Vivien Seguy,et al.  Smooth and Sparse Optimal Transport , 2017, AISTATS.

[21]  Robert J. McEliece,et al.  The generalized distributive law , 2000, IEEE Trans. Inf. Theory.

[22]  Barak A. Pearlmutter Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[23]  Damien Garreau,et al.  Metric Learning for Temporal Sequence Alignment , 2014, NIPS.

[24]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields , 2010, Found. Trends Mach. Learn..

[25]  Fu Jie Huang,et al.  A Tutorial on Energy-Based Learning , 2006 .

[26]  Andreas Krause,et al.  Differentiable Learning of Submodular Functions , 2017, NIPS.

[27]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[28]  Yoshua Bengio,et al.  Global training of document processing systems using graph transformer networks , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[29]  C. Michelot A finite algorithm for finding the projection of a point onto the canonical simplex of ∝n , 1986 .

[30]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[31]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[32]  Veselin Stoyanov,et al.  Minimum-Risk Training of Approximate CRF-Based NLP Systems , 2012, NAACL.

[33]  Ofer Meshi,et al.  Smooth and Strong: MAP Inference with Linear Convergence , 2015, NIPS.

[34]  Marco Cuturi,et al.  Soft-DTW: a Differentiable Loss Function for Time-Series , 2017, ICML.

[35]  J. Moreau Proximité et dualité dans un espace hilbertien , 1965 .

[36]  Ramón Fernández Astudillo,et al.  From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification , 2016, ICML.

[37]  Jason Eisner,et al.  Inside-Outside and Forward-Backward Algorithms Are Just Backprop (tutorial paper) , 2016, SPNLP@EMNLP.

[38]  Gökhan BakIr,et al.  Predicting Structured Data , 2008 .

[39]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[40]  Bryan Pardo,et al.  Soundprism: An Online System for Score-Informed Source Separation of Music Audio , 2011, IEEE Journal of Selected Topics in Signal Processing.

[41]  Joan Bruna,et al.  Divide and Conquer Networks , 2016, ICLR.

[42]  Eszter Gselmann Entropy functions and functional equations , 2011 .

[43]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[44]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[45]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[46]  Yurii Nesterov,et al.  Smooth minimization of non-smooth functions , 2005, Math. Program..

[47]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[48]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[49]  Graham Neubig,et al.  A Continuous Relaxation of Beam Search for End-to-end Training of Neural Sequence Models , 2017, AAAI.

[50]  Matthijs Douze,et al.  FastText.zip: Compressing text classification models , 2016, ArXiv.

[51]  T. Lindvall ON A ROUTING PROBLEM , 2004, Probability in the Engineering and Informational Sciences.