Lagrangian relaxation for natural language decoding

The major success story of natural language processing over the last decade has been the development of high-accuracy statistical methods for a wide-range of language applications. The availability of large textual data sets has made it possible to employ increasingly sophisticated statistical models to improve performance on language tasks. However, oftentimes these more complex models come at the cost of expanding the search-space of the underlying decoding problem. In this dissertation, we focus on the question of how to handle this challenge. In particular, we study the question of decoding in large-scale, statistical natural language systems. We aim to develop a formal understanding of the decoding problems behind these tasks and present algorithms that extend beyond common heuristic approaches to yield optimality guarantees. The main tool we utilize, Lagrangian relaxation, is a classical idea from the field of combinatorial optimization. We begin the dissertation by giving a general background introduction to the method and describe common models in natural language processing. The body of the dissertation consists of six chapters. The first three chapters discuss relaxation methods for core natural language tasks : (1) examines the classical problem of parsing and part-of-speech tagging; (2) addresses the problem of language model composition in syntactic machine translation; (3) develops efficient algorithms for non-projective dependency parsing. The second set of chapters discuss methods that utilize relaxation in combination with other combinatorial techniques: (1) develops an exact beam-search algorithm for machine translation; (2) uses a parsing relaxation in a coarse-to-fine cascade. At the core of each chapter is a relaxation of a difficult combinatorial problem and the implementation of this algorithm in a large-scale system. Thesis Supervisor: Michael Collins Title: Visiting Associate Professor, MIT Vikram S. Pandit Professor of Computer Science, Columbia University

[1]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[2]  Xavier Carreras,et al.  TAG, Dynamic Programming, and the Perceptron for Efficient, Feature-Rich Parsing , 2008, CoNLL.

[3]  Nikos Komodakis,et al.  MRF Optimization via Dual Decomposition: Message-Passing Revisited , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[4]  Kevin Knight,et al.  Decoding Complexity in Word-Replacement Translation Models , 1999, Comput. Linguistics.

[5]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[6]  David Chiang,et al.  Better k-best Parsing , 2005, IWPT.

[7]  André F. T. Martins,et al.  Fast and Robust Compressive Summarization with Dual Decomposition and Multi-Task Learning , 2013, ACL.

[8]  Noah A. Smith,et al.  Dual Decomposition with Many Overlapping Components , 2011, EMNLP.

[9]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[10]  Colin Cherry,et al.  Fast and Accurate Arc Filtering for Dependency Parsing , 2010, COLING.

[11]  Philipp Koehn,et al.  Pharaoh: A Beam Search Decoder for Phrase-Based Statistical Machine Translation Models , 2004, AMTA.

[12]  Dan Klein,et al.  Hierarchical Search for Parsing , 2009, HLT-NAACL.

[13]  Yuji Matsumoto,et al.  Statistical Dependency Analysis with Support Vector Machines , 2003, IWPT.

[14]  Joseph Le Roux,et al.  Combining PCFG-LA Models with Dual Decomposition: A Case Study with Function Labels and Binarization , 2013, EMNLP.

[15]  Alexander M. Rush,et al.  Vine Pruning for Efficient Multi-Pass Dependency Parsing , 2012, NAACL.

[16]  Richard Johansson,et al.  The CoNLL 2008 Shared Task on Joint Parsing of Syntactic and Semantic Dependencies , 2008, CoNLL.

[17]  Donald E. Knuth,et al.  A Generalization of Dijkstra's Algorithm , 1977, Inf. Process. Lett..

[18]  Eric P. Xing,et al.  Concise Integer Linear Programming Formulations for Dependency Parsing , 2009, ACL.

[19]  Dan Klein,et al.  Parsing and Hypergraphs , 2001, IWPT.

[20]  Mehryar Mohri,et al.  Finite-State Transducers in Language and Speech Processing , 1997, CL.

[21]  Slav Petrov,et al.  Coarse-to-Fine Natural Language Processing , 2011, Theory and Applications of Natural Language Processing.

[22]  Tommi S. Jaakkola,et al.  Fixing Max-Product: Convergent Message Passing Algorithms for MAP LP-Relaxations , 2007, NIPS.

[23]  Joakim Nivre,et al.  Integrating Graph-Based and Transition-Based Dependency Parsers , 2008, ACL.

[24]  Ben Taskar,et al.  Structured Prediction Cascades , 2010, AISTATS.

[25]  David Chiang,et al.  Forest Rescoring: Faster Decoding with Integrated Language Models , 2007, ACL.

[26]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[27]  William J. Byrne,et al.  Hierarchical Phrase-Based Translation with Weighted Finite-State Transducers and Shallow-n Grammars , 2010, CL.

[28]  Claude Lemaréchal,et al.  Lagrangian Relaxation , 2000, Computational Combinatorial Optimization.

[29]  O. Nelles,et al.  An Introduction to Optimization , 1996, IEEE Antennas and Propagation Magazine.

[30]  Fernando Pereira,et al.  Non-Projective Dependency Parsing using Spanning Tree Algorithms , 2005, HLT.

[31]  Taro Watanabe,et al.  Left-to-Right Target Generation for Hierarchical Phrase-Based Translation , 2006, ACL.

[32]  Christoph Tillmann,et al.  Efficient Dynamic Programming Search Algorithms for Phrase-Based SMT , 2006 .

[33]  Franziska Wulf,et al.  Minimization Methods For Non Differentiable Functions , 2016 .

[34]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[35]  Jason Eisner,et al.  Bilexical Grammars and their Cubic-Time Parsing Algorithms , 2000 .

[36]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[37]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[38]  John DeNero,et al.  Model-Based Aligner Combination Using Dual Decomposition , 2011, ACL.

[39]  Giorgio Satta,et al.  Dynamic Programming Algorithms for Transition-Based Dependency Parsers , 2011, ACL.

[40]  D. Sontag 1 Introduction to Dual Decomposition for Inference , 2010 .

[41]  Dan Klein,et al.  Coarse-to-Fine Syntactic Machine Translation using Language Projections , 2008, EMNLP.

[42]  Kenji Sagae,et al.  Dynamic Programming for Linear-Time Incremental Parsing , 2010, ACL.

[43]  Michael Collins,et al.  Three Generative, Lexicalised Models for Statistical Parsing , 1997, ACL.

[44]  Alexander M. Rush,et al.  A Tutorial on Dual Decomposition and Lagrangian Relaxation for Inference in Natural Language Processing , 2012, J. Artif. Intell. Res..

[45]  Alexander M. Rush,et al.  Exact Decoding of Syntactic Translation Models through Lagrangian Relaxation , 2011, ACL.

[46]  Shankar Kumar,et al.  Local Phrase Reordering Models for Statistical Machine Translation , 2005, HLT.

[47]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[48]  Fernando Pereira,et al.  Discriminative learning and spanning tree algorithms for dependency parsing , 2006 .

[49]  David A. Smith,et al.  Dependency Parsing by Belief Propagation , 2008, EMNLP.

[50]  Giorgio Satta,et al.  On the Complexity of Non-Projective Data-Driven Dependency Parsing , 2007, IWPT.

[51]  Daniel P. Huttenlocher,et al.  Efficient Belief Propagation for Early Vision , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[52]  Haitao Mi,et al.  Efficient Incremental Decoding for Tree-to-String Translation , 2010, EMNLP.

[53]  Noah A. Smith,et al.  Probabilistic Models of Nonprojective Dependency Trees , 2007, EMNLP.

[54]  Brian Roark,et al.  Classifying Chart Cells for Quadratic Complexity Context-Free Inference , 2008, COLING.

[55]  Adam Lopez,et al.  A Comparison of Loopy Belief Propagation and Dual Decomposition for Integrated CCG Supertagging and Parsing , 2011, ACL.

[56]  Vladimir Eidelman,et al.  cdec: A Decoder, Alignment, and Learning Framework for Finite- State and Context-Free Translation Models , 2010, ACL.

[57]  Christopher Ré,et al.  Scaling Inference for Markov Logic via Dual Decomposition , 2012, 2012 IEEE 12th International Conference on Data Mining.

[58]  Daniel Tarlow,et al.  Using Combinatorial Optimization within Max-Product Belief Propagation , 2006, NIPS.

[59]  Eric P. Xing,et al.  Stacking Dependency Parsers , 2008, EMNLP.

[60]  Jason Eisner,et al.  A fast finite-state relaxation method for enforcing global constraints on sequence decoding , 2006, NAACL.

[61]  Thorsten Joachims,et al.  Training structural SVMs when exact inference is intractable , 2008, ICML '08.

[62]  Xavier Carreras,et al.  Simple Semi-supervised Dependency Parsing , 2008, ACL.

[63]  Giorgio Satta,et al.  Efficient Parsing for Bilexical Context-Free Grammars and Head Automaton Grammars , 1999, ACL.

[64]  Xavier Carreras,et al.  Structured Prediction Models via the Matrix-Tree Theorem , 2007, EMNLP.

[65]  George B. Dantzig,et al.  Decomposition Principle for Linear Programs , 1960 .

[66]  Ronald L. Rardin,et al.  Polyhedral Characterization of Discrete Dynamic Programming , 1990, Oper. Res..

[67]  Marshall L. Fisher,et al.  The Lagrangian Relaxation Method for Solving Integer Programming Problems , 2004, Manag. Sci..

[68]  Koby Crammer,et al.  Ultraconservative Online Algorithms for Multiclass Problems , 2001, J. Mach. Learn. Res..

[69]  Michael J. Paul,et al.  Implicitly Intersecting Weighted Automata using Dual Decomposition , 2012, HLT-NAACL.

[70]  David Chiang,et al.  A Hierarchical Phrase-Based Model for Statistical Machine Translation , 2005, ACL.

[71]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[72]  Alexander M. Rush,et al.  Dual Decomposition for Parsing with Non-Projective Head Automata , 2010, EMNLP.

[73]  William J. Byrne,et al.  Rule Filtering by Pattern for Efficient Hierarchical Translation , 2009, EACL.

[74]  Xavier Carreras,et al.  Experiments with a Higher-Order Projective Dependency Parser , 2007, EMNLP.

[75]  Dan Klein,et al.  Improved Inference for Unlexicalized Parsing , 2007, NAACL.

[76]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[77]  Joakim Nivre,et al.  Transition-based Dependency Parsing with Rich Non-local Features , 2011, ACL.

[78]  Wanxiang Che,et al.  Joint Word Alignment and Bilingual Named Entity Recognition Using Dual Decomposition , 2013, ACL.

[79]  Michael Collins,et al.  Head-Driven Statistical Models for Natural Language Parsing , 2003, CL.

[80]  Andrew McCallum,et al.  Fast and Robust Joint Models for Biomedical Event Extraction , 2011, EMNLP.

[81]  Koby Crammer,et al.  Online Large-Margin Training of Dependency Parsers , 2005, ACL.

[82]  Dan Klein,et al.  Fast Exact Inference with a Factored Model for Natural Language Parsing , 2002, NIPS.

[83]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[84]  David Ellis,et al.  Multilevel Coarse-to-Fine PCFG Parsing , 2006, NAACL.

[85]  Daniel Marcu,et al.  SPMT: Statistical Machine Translation with Syntactified Target Language Phrases , 2006, EMNLP.

[86]  Jun'ichi Tsujii,et al.  Coordination Structure Analysis using Dual Decomposition , 2012, EACL.

[87]  Sebastian Riedel,et al.  Incremental Integer Linear Programming for Non-projective Dependency Parsing , 2006, EMNLP.

[88]  Tommi S. Jaakkola,et al.  Tightening LP Relaxations for MAP using Message Passing , 2008, UAI.

[89]  Michael Collins,et al.  Exact Decoding of Phrase-Based Translation Models through Lagrangian Relaxation , 2011, EMNLP.

[90]  Harvey J. Everett Generalized Lagrange Multiplier Method for Solving Problems of Optimum Allocation of Resources , 1963 .

[91]  David Chiang,et al.  Hierarchical Phrase-Based Translation , 2007, CL.

[92]  Alexander M. Rush,et al.  On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing , 2010, EMNLP.

[93]  Fernando Pereira,et al.  Structured Learning with Approximate Inference , 2007, NIPS.

[94]  Fernando Pereira,et al.  Online Learning of Approximate Dependency Parsing Algorithms , 2006, EACL.

[95]  Ben Taskar,et al.  Word Alignment via Quadratic Assignment , 2006, NAACL.

[96]  Michael Collins,et al.  Efficient Third-Order Dependency Parsers , 2010, ACL.

[97]  Regina Barzilay,et al.  Multi-Event Extraction Guided by Global Constraints , 2012, NAACL.

[98]  R. Darnell Translation , 1873, The Indian medical gazette.

[99]  Joakim Nivre,et al.  Memory-Based Dependency Parsing , 2004, CoNLL.

[100]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[101]  Christopher D. Manning,et al.  Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger , 2000, EMNLP.

[102]  Nikos Komodakis,et al.  MRF Energy Minimization and Beyond via Dual Decomposition , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[103]  Jinxi Xu,et al.  A New String-to-Dependency Machine Translation Algorithm with a Target Dependency Language Model , 2008, ACL.

[104]  Daniel Marcu,et al.  Fast Decoding and Optimal Decoding for Machine Translation , 2001, ACL.

[105]  Richard M. Karp,et al.  The traveling-salesman problem and minimum spanning trees: Part II , 1971, Math. Program..

[106]  Jens Vygen,et al.  The Book Review Column1 , 2020, SIGACT News.

[107]  Noah A. Smith,et al.  Parsing with Soft and Hard Constraints on Dependency Length , 2005 .

[108]  Sabine Buchholz,et al.  CoNLL-X Shared Task on Multilingual Dependency Parsing , 2006, CoNLL.

[109]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[110]  Ben Taskar,et al.  Max-Margin Parsing , 2004, EMNLP.

[111]  Dmitry M. Malioutov,et al.  Lagrangian Relaxation for MAP Estimation in Graphical Models , 2007, ArXiv.

[112]  Yair Weiss,et al.  Linear Programming Relaxations and Belief Propagation - An Empirical Study , 2006, J. Mach. Learn. Res..

[113]  Slav Petrov,et al.  A Universal Part-of-Speech Tagset , 2011, LREC.

[114]  Dan Roth,et al.  Integer linear programming inference for conditional random fields , 2005, ICML.

[115]  Noah A. Smith,et al.  An Exact Dual Decomposition Algorithm for Shallow Semantic Parsing with Constraints , 2012, *SEMEVAL.

[116]  Hiyan Alshawi,et al.  Head Automata and Bilingual Tiling: Translation with Minimal Representations , 1996, ACL.

[117]  Lillian Lee,et al.  Fast context-free grammar parsing requires fast boolean matrix multiplication , 2001, JACM.

[118]  Dan Roth,et al.  Learning and Inference over Constrained Output , 2005, IJCAI.