Advances in discriminative dependency parsing

Achieving a greater understanding of natural language syntax and parsing is a critical step in producing useful natural language processing systems. In this thesis, we focus on the formalism of dependency grammar as it allows one to model important head-modifier relationships with a minimum of extraneous structure. Recent research in dependency parsing has highlighted the discriminative structured prediction framework (McDonald et al., 2005a; Carreras, 2007; Suzuki et al., 2009), which is characterized by two advantages: first, the availability of powerful discriminative learning algorithms like log-linear and max-margin models (Lafferty et al., 2001; Taskar et al., 2003), and second, the ability to use arbitrarily-defined feature representations. This thesis explores three advances in the field of discriminative dependency parsing. First, we show that the classic Matrix-Tree Theorem (Kirchhoff, 1847; Tutte, 1984) can be applied to the problem of non-projective dependency parsing, enabling both log-linear and max-margin parameter estimation in this setting. Second, we present novel third-order dependency parsing algorithms that extend the amount of context available to discriminative parsers while retaining computational complexity equivalent to existing second-order parsers. Finally, we describe a simple but effective method for augmenting the features of a dependency parser with information derived from standard clustering algorithms; our semi-supervised approach is able to deliver consistent benefits regardless of the amount of available training data. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

[1]  John Cocke,et al.  Programming languages and their compilers: Preliminary notes , 1969 .

[2]  Qun Liu,et al.  Forest-Based Translation , 2008, ACL.

[3]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[4]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[5]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[6]  L. Bregman The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming , 1967 .

[7]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[8]  Jason Eisner,et al.  Bilexical Grammars and their Cubic-Time Parsing Algorithms , 2000 .

[9]  Ben Taskar,et al.  Exponentiated Gradient Algorithms for Large-margin Structured Classification , 2004, NIPS.

[10]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[11]  Noah A. Smith,et al.  Contrastive Estimation: Training Log-Linear Models on Unlabeled Data , 2005, ACL.

[12]  W. T. Tutte Graph Theory , 1984 .

[13]  Sabine Buchholz,et al.  CoNLL-X Shared Task on Multilingual Dependency Parsing , 2006, CoNLL.

[14]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2007, ICML '07.

[15]  Liliane Haegeman,et al.  Introduction to Government and Binding Theory , 1991 .

[16]  Daniel H. Younger,et al.  Recognition and Parsing of Context-Free Languages in Time n^3 , 1967, Inf. Control..

[17]  Joakim Nivre,et al.  Memory-Based Dependency Parsing , 2004, CoNLL.

[18]  David Haussler,et al.  Probabilistic kernel regression models , 1999, AISTATS.

[19]  Ben Taskar,et al.  Max-Margin Parsing , 2004, EMNLP.

[20]  Giuseppe Attardi,et al.  Experiments with a Multilanguage Non-Projective Dependency Parser , 2006, CoNLL.

[21]  Xavier Carreras,et al.  Exponentiated gradient algorithms for log-linear structured prediction , 2007, ICML '07.

[22]  Slav Petrov,et al.  Products of Random Latent Variable Grammars , 2010, NAACL.

[23]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[24]  Mark Johnson,et al.  Estimators for Stochastic “Unification-Based” Grammars , 1999, ACL.

[25]  Alexander Clark,et al.  Inducing Syntactic Categories by Context Distribution Clustering , 2000, CoNLL/LLL.

[26]  Jan Hajic,et al.  Prague Arabic Dependency Treebank: Development in Data and Tools , 2004 .

[27]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[28]  Léon Bottou,et al.  Stochastic Learning , 2003, Advanced Lectures on Machine Learning.

[29]  Eugene Charniak,et al.  Effective Self-Training for Parsing , 2006, NAACL.

[30]  Michael Collins,et al.  Hidden-Variable Models for Discriminative Reranking , 2005, HLT.

[31]  Koby Crammer,et al.  Online Classification on a Budget , 2003, NIPS.

[32]  Ivan Titov,et al.  Constituent Parsing with Incremental Sigmoid Belief Networks , 2007, ACL.

[33]  Michael Collins,et al.  Discriminative Reranking for Natural Language Parsing , 2000, CL.

[34]  S. Chopra On the spanning tree polyhedron , 1989 .

[35]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[36]  Noam Chomsky,et al.  Three models for the description of language , 1956, IRE Trans. Inf. Theory.

[37]  G. Kirchhoff Ueber die Auflösung der Gleichungen, auf welche man bei der Untersuchung der linearen Vertheilung galvanischer Ströme geführt wird , 1847 .

[38]  Manfred K. Warmuth,et al.  Relative Loss Bounds for Multidimensional Regression Problems , 1997, Machine Learning.

[39]  Saso Dzeroski,et al.  Towards a Slovene Dependency Treebank , 2006, LREC.

[40]  Joakim Nivre,et al.  Pseudo-Projective Dependency Parsing , 2005, ACL.

[41]  Albert B Novikoff,et al.  ON CONVERGENCE PROOFS FOR PERCEPTRONS , 1963 .

[42]  Keith Hall,et al.  Corrective Modeling for Non-Projective Dependency Parsing , 2005, IWPT.

[43]  Mark Johnson,et al.  PCFG Models of Linguistic Tree Representations , 1998, CL.

[44]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[45]  Jun'ichi Tsujii,et al.  Probabilistic CFG with Latent Annotations , 2005, ACL.

[46]  Scott Miller,et al.  Name Tagging with Word Clusters and Discriminative Training , 2004, NAACL.

[47]  Fernando Pereira,et al.  Non-Projective Dependency Parsing using Spanning Tree Algorithms , 2005, HLT.

[48]  Dale Schuurmans,et al.  Strictly Lexical Dependency Parsing , 2005, IWPT.

[49]  Yuji Matsumoto,et al.  Statistical Dependency Analysis with Support Vector Machines , 2003, IWPT.

[50]  Jan Hajic,et al.  The Prague Dependency Treebank , 2003 .

[51]  Gunnar Rätsch,et al.  Advanced Lectures on Machine Learning , 2004, Lecture Notes in Computer Science.

[52]  Thomas Hofmann,et al.  Support vector machine learning for interdependent and structured output spaces , 2004, ICML.

[53]  Gideon S. Mann,et al.  Distributed Training Strategies for the Structured Perceptron , 2010, NAACL.

[54]  Koby Crammer,et al.  Online Passive-Aggressive Algorithms , 2003, J. Mach. Learn. Res..

[55]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[56]  Richard Johansson,et al.  The CoNLL 2008 Shared Task on Joint Parsing of Syntactic and Semantic Dependencies , 2008, CoNLL.

[57]  Percy Liang,et al.  Semi-Supervised Learning for Natural Language , 2005 .

[58]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[59]  Thorsten Joachims,et al.  Cutting-plane training of structural SVMs , 2009, Machine Learning.

[60]  Mark A. Paskin,et al.  Cubic-time Parsing and Learning Algorithms for Grammatical Bigram , 2001 .

[61]  Wei Li,et al.  Semi-Supervised Sequence Modeling with Syntactic Topic Models , 2005, AAAI.

[62]  Noah A. Smith,et al.  Computationally Efficient M-Estimation of Log-Linear Structure Models , 2007, ACL.

[63]  Xavier Carreras,et al.  Simple Semi-supervised Dependency Parsing , 2008, ACL.

[64]  Yuji Matsumoto,et al.  Machine Learning-based Dependency Analyzer for Chinese , 2005, J. Chin. Lang. Comput..

[65]  Giorgio Satta,et al.  Efficient Parsing for Bilexical Context-Free Grammars and Head Automaton Grammars , 1999, ACL.

[66]  Xavier Carreras,et al.  Structured Prediction Models via the Matrix-Tree Theorem , 2007, EMNLP.

[67]  Jason Eisner,et al.  Three New Probabilistic Models for Dependency Parsing: An Exploration , 1996, COLING.

[68]  Noam Chomsky,et al.  वाक्यविन्यास का सैद्धान्तिक पक्ष = Aspects of the theory of syntax , 1965 .

[69]  Frank Harary,et al.  Graph Theory , 2016 .

[70]  Xavier Carreras,et al.  Experiments with a Higher-Order Projective Dependency Parser , 2007, EMNLP.

[71]  Liang Huang,et al.  Forest Reranking: Discriminative Parsing with Non-Local Features , 2008, ACL.

[72]  Gertjan van Noord,et al.  The Alpino Dependency Treebank , 2001, CLIN.

[73]  Eugene Charniak,et al.  A Maximum-Entropy-Inspired Parser , 2000, ANLP.

[74]  Michael Collins,et al.  Efficient Third-Order Dependency Parsers , 2010, ACL.

[75]  Dilek Z. Hakkani-Tür,et al.  Building a Turkish Treebank , 2003 .

[76]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[77]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[78]  Tong Zhang,et al.  Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[79]  Michael Collins,et al.  A New Statistical Parser Based on Bigram Lexical Dependencies , 1996, ACL.

[80]  D. Huffman A Method for the Construction of Minimum-Redundancy Codes , 1952 .

[81]  Michael Collins,et al.  A Statistical Parser for Czech , 1999, ACL.

[82]  Montserrat Civit Torruella,et al.  Design Principles for a Spanish Treebank , 2002 .

[83]  John D. Lafferty,et al.  Decision Tree Parsing using a Hidden Derivation Model , 1994, HLT.

[84]  Sebastian Riedel,et al.  The CoNLL 2007 Shared Task on Dependency Parsing , 2007, EMNLP.

[85]  Xavier Carreras,et al.  TAG, Dynamic Programming, and the Perceptron for Efficient, Feature-Rich Parsing , 2008, CoNLL.

[86]  Tadao Kasami,et al.  An Efficient Recognition and Syntax-Analysis Algorithm for Context-Free Languages , 1965 .

[87]  Xavier Carreras,et al.  An Empirical Study of Semi-supervised Structured Conditional Models for Dependency Parsing , 2009, EMNLP.

[88]  Jun Suzuki,et al.  Semi-Supervised Sequential Labeling and Segmentation Using Giga-Word Scale Unlabeled Data , 2008, ACL.

[89]  Stefan Riezler,et al.  Speed and Accuracy in Shallow and Deep Stochastic Parsing , 2004, NAACL.

[90]  A. Hasman,et al.  Probabilistic reasoning in intelligent systems: Networks of plausible inference , 1991 .

[91]  Fernando Pereira,et al.  Online Learning of Approximate Dependency Parsing Algorithms , 2006, EACL.

[92]  Koby Crammer,et al.  Online Large-Margin Training of Dependency Parsers , 2005, ACL.

[93]  David A. McAllester On the complexity analysis of static analyses , 2002, JACM.

[94]  Fernando Pereira,et al.  Multilingual Dependency Analysis with a Two-Stage Discriminative Parser , 2006, CoNLL.

[95]  Dan Klein,et al.  Improved Inference for Unlexicalized Parsing , 2007, NAACL.

[96]  Christopher D. Manning,et al.  The Infinite Tree , 2007, ACL.

[97]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[98]  Giorgio Satta,et al.  On the Complexity of Non-Projective Data-Driven Dependency Parsing , 2007, IWPT.

[99]  J. Baker Trainable grammars for speech recognition , 1979 .

[100]  Peter L. Bartlett,et al.  Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks , 2008, J. Mach. Learn. Res..

[101]  Eugene Charniak,et al.  Coarse-to-Fine n-Best Parsing and MaxEnt Discriminative Reranking , 2005, ACL.

[102]  Richard Johansson,et al.  The CoNLL-2009 Shared Task: Syntactic and Semantic Dependencies in Multiple Languages , 2009, CoNLL Shared Task.

[103]  Noah A. Smith,et al.  Probabilistic Models of Nonprojective Dependency Trees , 2007, EMNLP.

[104]  Michael Collins,et al.  Three Generative, Lexicalised Models for Statistical Parsing , 1997, ACL.

[105]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT.

[106]  Eugene Charniak,et al.  Statistical Parsing with a Context-Free Grammar and Word Statistics , 1997, AAAI/IAAI.

[107]  Fernando Pereira,et al.  Discriminative learning and spanning tree algorithms for dependency parsing , 2006 .

[108]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[109]  Y. Singer,et al.  Ultraconservative online algorithms for multiclass problems , 2003 .

[110]  Joakim Nivre,et al.  Labeled Pseudo-Projective Dependency Parsing with Support Vector Machines , 2006, CoNLL.

[111]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[112]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[113]  Mark Johnson,et al.  Joint and Conditional Estimation of Tagging and Parsing Models , 2001, ACL.

[114]  Xavier Carreras,et al.  Non-Projective Parsing for Statistical Machine Translation , 2009, EMNLP.

[115]  Michael Collins,et al.  Head-Driven Statistical Models for Natural Language Parsing , 2003, CL.

[116]  Joakim Nivre,et al.  Characterizing the Errors of Data-Driven Dependency Parsers , 2007 .

[117]  Christopher D. Manning,et al.  Efficient, Feature-based, Conditional Random Field Parsing , 2008, ACL.

[118]  Dan Klein,et al.  Learning Accurate, Compact, and Interpretable Tree Annotation , 2006, ACL.

[119]  John D. Lafferty,et al.  Boosting and Maximum Likelihood for Exponential Models , 2001, NIPS.

[120]  Jinxi Xu,et al.  A New String-to-Dependency Machine Translation Algorithm with a Target Dependency Language Model , 2008, ACL.

[121]  David Yarowsky,et al.  One Sense Per Discourse , 1992, HLT.