论文信息 - Differentiable Dynamic Programming for Structured Prediction and Attention

Differentiable Dynamic Programming for Structured Prediction and Attention

Dynamic programming (DP) solves a variety of structured combinatorial problems by iteratively breaking them down into smaller subproblems. In spite of their versatility, DP algorithms are usually non-differentiable, which hampers their use as a layer in neural networks trained by backpropagation. To address this issue, we propose to smooth the max operator in the dynamic programming recursion, using a strongly convex regularizer. This allows to relax both the optimal value and solution of the original combinatorial problem, and turns a broad class of DP algorithms into differentiable operators. Theoretically, we provide a new probabilistic perspective on backpropagating through these DP operators, and relate them to inference in graphical models. We derive two particular instantiations of our framework, a smoothed Viterbi algorithm for sequence prediction and a smoothed DTW algorithm for time-series alignment. We showcase these instantiations on two structured prediction tasks and on structured and sparse attention for neural machine translation.

Arthur Mensch | Mathieu Blondel | Mathieu Blondel | A. Mensch

[1] Robert A. Sulanke,et al. OBJECTS COUNTED BY THE CENTRAL DELANNOY NUMBERS , 2003 .

[2] A. Fiacco. A Finite Algorithm for Finding the Projection of a Point onto the Canonical Simplex of R " , 2009 .

[3] L. Baum,et al. Statistical Inference for Probabilistic Functions of Finite State Markov Chains , 1966 .

[4] Yoram Singer,et al. Efficient projections onto the l1-ball for learning in high dimensions , 2008, ICML '08.

[5] Claire Cardie,et al. SparseMAP: Differentiable Sparse Structured Inference , 2018, ICML.

[6] S. Verdú,et al. Abstract dynamic programming models under commutativity conditions , 1987 .

[7] David A. Smith,et al. Minimum Risk Annealing for Training Log-Linear Models , 2006, ACL.

[8] J. Zico Kolter,et al. OptNet: Differentiable Optimization as a Layer in Neural Networks , 2017, ICML.

[9] D. Bertsekas. Control of uncertain systems with a set-membership description of the uncertainty , 1971 .

[10] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[11] J. Danskin. The Theory of Max-Min, with Applications , 1966 .

[12] Michael I. Jordan,et al. Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[13] Vlad Niculae,et al. A Regularized Framework for Sparse and Structured Neural Attention , 2017, NIPS.

[14] Cyril Banderier,et al. Why Delannoy numbers? , 2004, ArXiv.

[15] Marc Teboulle,et al. Smoothing and First Order Methods: A Unified Framework , 2012, SIAM J. Optim..

[16] S. Chiba,et al. Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[17] R Bellman,et al. On the Theory of Dynamic Programming. , 1952, Proceedings of the National Academy of Sciences of the United States of America.

[18] Eduard H. Hovy,et al. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF , 2016, ACL.

[19] Alexander M. Rush,et al. Structured Attention Networks , 2017, ICLR.

[20] Vivien Seguy,et al. Smooth and Sparse Optimal Transport , 2017, AISTATS.

[21] Robert J. McEliece,et al. The generalized distributive law , 2000, IEEE Trans. Inf. Theory.

[22] Barak A. Pearlmutter. Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[23] Damien Garreau,et al. Metric Learning for Temporal Sequence Alignment , 2014, NIPS.

[24] Andrew McCallum,et al. An Introduction to Conditional Random Fields , 2010, Found. Trends Mach. Learn..

[25] Fu Jie Huang,et al. A Tutorial on Energy-Based Learning , 2006 .

[26] Andreas Krause,et al. Differentiable Learning of Submodular Functions , 2017, NIPS.

[27] Christopher D. Manning,et al. Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[28] Yoshua Bengio,et al. Global training of document processing systems using graph transformer networks , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[29] C. Michelot. A finite algorithm for finding the projection of a point onto the canonical simplex of ∝n , 1986 .

[30] Judea Pearl,et al. Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[31] Lawrence R. Rabiner,et al. A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[32] Veselin Stoyanov,et al. Minimum-Risk Training of Approximate CRF-Based NLP Systems , 2012, NAACL.

[33] Ofer Meshi,et al. Smooth and Strong: MAP Inference with Linear Convergence , 2015, NIPS.

[34] Marco Cuturi,et al. Soft-DTW: a Differentiable Loss Function for Time-Series , 2017, ICML.

[35] J. Moreau. Proximité et dualité dans un espace hilbertien , 1965 .

[36] Ramón Fernández Astudillo,et al. From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification , 2016, ICML.

[37] Jason Eisner,et al. Inside-Outside and Forward-Backward Algorithms Are Just Backprop (tutorial paper) , 2016, SPNLP@EMNLP.

[38] Gökhan BakIr,et al. Predicting Structured Data , 2008 .

[39] Andrew J. Viterbi,et al. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[40] Bryan Pardo,et al. Soundprism: An Online System for Score-Informed Source Separation of Music Audio , 2011, IEEE Journal of Selected Topics in Signal Processing.

[41] Joan Bruna,et al. Divide and Conquer Networks , 2016, ICLR.

[42] Eszter Gselmann. Entropy functions and functional equations , 2011 .

[43] Guillaume Lample,et al. Neural Architectures for Named Entity Recognition , 2016, NAACL.

[44] Thomas Hofmann,et al. Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[45] Jason Weston,et al. Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[46] Yurii Nesterov,et al. Smooth minimization of non-smooth functions , 2005, Math. Program..

[47] Michael Collins,et al. Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[48] Andrew McCallum,et al. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[49] Graham Neubig,et al. A Continuous Relaxation of Beam Search for End-to-end Training of Neural Sequence Models , 2017, AAAI.

[50] Matthijs Douze,et al. FastText.zip: Compressing text classification models , 2016, ArXiv.

[51] T. Lindvall. ON A ROUTING PROBLEM , 2004, Probability in the Engineering and Informational Sciences.