Parsing Models for Identifying Multiword Expressions

Multiword expressions lie at the syntax/semantics interface and have motivated alternative theories of syntax like Construction Grammar. Until now, however, syntactic analysis and multiword expression identification have been modeled separately in natural language processing. We develop two structured prediction models for joint parsing and multiword expression identification. The first is based on context-free grammars and the second uses tree substitution grammars, a formalism that can store larger syntactic fragments. Our experiments show that both models can identify multiword expressions with much higher accuracy than a state-of-the-art system based on word co-occurrence statistics.We experiment with Arabic and French, which both have pervasive multiword expressions. Relative to English, they also have richer morphology, which induces lexical sparsity in finite corpora. To combat this sparsity, we develop a simple factored lexical representation for the context-free parsing model. Morphological analyses are automatically transformed into rich feature tags that are scored jointly with lexical items. This technique, which we call a factored lexicon, improves both standard parsing and multiword expression identification accuracy.

[1]  Sandra Kübler,et al.  How does treebank annotation influence parsing? Or how not to compare apples and oranges , 2007 .

[2]  Seth Kulick,et al.  Using Derivation Trees for Treebank Error Detection , 2011, ACL.

[3]  Roger Levy,et al.  Is it Harder to Parse Chinese, or the Chinese Treebank? , 2003, ACL.

[4]  Marine Carpuat,et al.  Task-based Evaluation of Multiword Expressions: a Pilot Study in Statistical Machine Translation , 2010, NAACL.

[5]  Conor Cafferkey Exploiting multi-word units in statistical parsing and generation , 2008 .

[6]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[7]  M. West,et al.  Hyperparameter estimation in Dirichlet process mixture models , 1992 .

[8]  M. Maamouri,et al.  The Penn Arabic Treebank: Building a Large-Scale Annotated Arabic Corpus , 2004 .

[9]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[10]  Slav Petrov,et al.  Coarse-to-Fine Natural Language Processing , 2011, Theory and Applications of Natural Language Processing.

[11]  Mary P. Harper,et al.  Feature-Rich Log-Linear Lexical Model for Latent Variable PCFG Grammars , 2011, IJCNLP.

[12]  Ioannis Korkontzelos,et al.  Can Recognising Multiword Expressions Improve Shallow Parsing? , 2010, HLT-NAACL.

[13]  Noam Chomsky,et al.  Lectures on Government and Binding , 1981 .

[14]  Josef van Genabith,et al.  Automatic Extraction of Arabic Multiword Expressions , 2010, MWE@COLING.

[15]  Mark Johnson,et al.  PCFG Models of Linguistic Tree Representations , 1998, CL.

[16]  M. Maamouri,et al.  Creating a Methodology for Large-Scale Correction of Treebank Annotation : The Case of the Arabic Treebank , 2009 .

[17]  Ashraf Abdou Arabic Idioms: A Corpus Based Study , 2011 .

[18]  Marie Candito,et al.  Improving generative statistical parsing with semi-supervised word clustering , 2009, IWPT.

[19]  Ray Jackendoff,et al.  The Architecture of the Language Faculty , 1996 .

[20]  Timothy Baldwin,et al.  Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.

[21]  Markus Dickinson,et al.  Error detection and correction in annotated corpora , 2005 .

[22]  Anne Abeillé,et al.  Parsing French with Tree Adjoining Grammar: some linguistic accounts , 1988, COLING.

[23]  Philipp Koehn,et al.  Factored Translation Models , 2007, EMNLP.

[24]  Timothy Baldwin,et al.  Multiword Expressions , 2010, Handbook of Natural Language Processing.

[25]  Abhishek Arun,et al.  Statistical Parsing of the French Treebank , 2004 .

[26]  Dan Klein,et al.  Type-Based MCMC , 2010, HLT-NAACL.

[27]  Violeta Seretan Syntax-Based Collocation Extraction , 2010 .

[28]  Phil Blunsom,et al.  Inducing Tree-Substitution Grammars , 2010, J. Mach. Learn. Res..

[29]  Driss Aboutajdine,et al.  A Multi-Word Term Extraction Program for Arabic Language , 2008, LREC.

[30]  Josef van Genabith,et al.  Lemmatization and Lexicalized Statistical Parsing of Morphologically-Rich Languages: the Case of French , 2010, SPMRL@NAACL-HLT.

[31]  K. Vijay-Shanker,et al.  The Use of Shared Forests in Tree Adjoining Grammar Parsing , 1993, EACL.

[32]  Daniel M. Bikel,et al.  Intricacies of Collins’ Parsing Model , 2004, CL.

[33]  Alexandra Kinyon,et al.  Building a Treebank for French , 2000, LREC.

[34]  Dan Klein,et al.  Learning Accurate, Compact, and Interpretable Tree Annotation , 2006, ACL.

[35]  Maurice Gross,et al.  Lexicon-Grammar and the Syntactic Analysis of French , 1984, ACL.

[36]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[37]  Seth Kulick,et al.  Enhancing the Arabic Treebank: a Collaborative Effort toward New Annotation Guidelines , 2008, LREC.

[38]  Christopher D. Manning,et al.  Better Arabic Parsing: Baselines, Evaluations, and Analysis , 2010, COLING.

[39]  SmadjaFrank Retrieving collocations from text , 1993 .

[40]  Marie Candito,et al.  Expériences d’analyse syntaxique statistique du français , 2008, JEPTALNRECITAL.

[41]  Christopher D. Manning,et al.  Multiword Expression Identification with Tree Substitution Grammars: A Parsing tour de force with French , 2011, EMNLP.

[42]  Christopher D. Manning,et al.  Joint Parsing and Named Entity Recognition , 2009, NAACL.

[43]  Dan Klein,et al.  Improved Inference for Unlexicalized Parsing , 2007, NAACL.

[44]  Nizar Habash,et al.  Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop , 2005, ACL.

[45]  M. Tomasello Regularity and Idiomaticity in Grammatical Constructions: The Case of Let Alone , 2003 .

[46]  Phil Blunsom,et al.  Inducing Compact but Accurate Tree-Substitution Grammars , 2009, NAACL.

[47]  Rens Bod,et al.  A Computational Model of Language Performance: Data Oriented Parsing , 1992, COLING.

[48]  Dekang Lin,et al.  Automatic Identification of Non-compositional Phrases , 1999, ACL.

[49]  Pascal Denis,et al.  Statistical French Dependency Parsing: Treebank Conversion and First Results , 2010, LREC.

[50]  Anne Abeillé,et al.  Parsing Idioms in Lexicalized TAGs , 1989, EACL.

[51]  Maurice Gross,et al.  Lexicon - Grammar The Representation of Compound Words , 1986, COLING.

[52]  Joshua B. Tenenbaum,et al.  Fragment Grammars: Exploring Computation and Reuse in Language , 2009 .

[53]  William O'grady,et al.  The Syntax of Idioms , 1998 .

[54]  Pavel Pecina,et al.  Lexical association measures and collocation extraction , 2009, Lang. Resour. Evaluation.

[55]  Nizar Habash,et al.  Parsing Arabic Dialects , 2006, EACL.

[56]  Isabelle Tellier,et al.  Evaluating the Impact of External Lexical Resources into a CRF-based Multiword Segmenter and Part-of-Speech Tagger , 2012, LREC.

[57]  Frank Keller,et al.  Lexicalization in Crosslinguistic Probabilistic Parsing: The Case of French , 2005, ACL.

[58]  Sandra Kübler How Do Treebank Annotation Schemes Influence Parsing Results? Or How Not to Compare Apples And Oranges , 2005 .

[59]  Geoffrey Sampson,et al.  A test of the leaf-ancestor metric for parse accuracy , 2003, Natural Language Engineering.

[60]  Djamé Seddah,et al.  Exploring the Spinal-STIG Model for Parsing French , 2010, LREC.

[61]  Josef van Genabith,et al.  Preparing, restructuring, and augmenting a French treebank:lexicalised parsers or coherent treebanks? , 2007 .

[62]  Matt Post,et al.  Bayesian Learning of a Tree Substitution Grammar , 2009, ACL.

[63]  Josef van Genabith,et al.  Learning Morphology with Morfette , 2008, LREC.

[64]  Marie Candito,et al.  Cross parser evaluation and tagset variation: a French treebank study , 2009 .

[65]  Josef van Genabith,et al.  Treebank Annotation Schemes and Parser Evaluation for German , 2007, EMNLP.

[66]  Dan Klein,et al.  Simple, Accurate Parsing with an All-Fragments Grammar , 2010, ACL.

[67]  Jeff A. Bilmes,et al.  Factored Language Models and Generalized Parallel Backoff , 2003, NAACL.

[68]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[69]  Marie Candito,et al.  Parsing Word Clusters , 2010, SPMRL@NAACL-HLT.

[70]  Carlos Ramisch,et al.  mwetoolkit: a Framework for Multiword Expression Identification , 2010, LREC.

[71]  Roger Levy,et al.  Tregex and Tsurgeon: tools for querying and manipulating tree data structures , 2006, LREC.

[72]  Frank Smadja,et al.  Retrieving Collocations from Text: Xtract , 1993, CL.

[73]  Alec Marantz,et al.  No escape from syntax: Don't try morphological analysis in the privacy of your own lexicon , 1997 .

[74]  Eric Wehrli,et al.  Parsing and Collocations , 2000, Natural Language Processing.

[75]  Patrick Watrin,et al.  Discriminative Strategies to Integrate Multiword Expression Recognition and Parsing , 2012, ACL.

[76]  Eric Laporte,et al.  A French Corpus Annotated for Multiword Expressions with Adverbial Function , 2008, LAW II 2008.

[77]  Timothy Baldwin,et al.  Multilingual Deep Lexical Acquisition for HPSGs via Supertagging , 2006, EMNLP.

[78]  Josef van Genabith,et al.  Handling Unknown Words in Statistical Latent-Variable Parsing Models for Arabic, English and French , 2010, SPMRL@NAACL-HLT.

[79]  Seth Kulick,et al.  Parsing the Arabic Treebank: Analysis and Improvements , 2006 .

[80]  Vladimir Eidelman,et al.  cdec: A Decoder, Alignment, and Learning Framework for Finite- State and Context-Free Translation Models , 2010, ACL.

[81]  Karin C. Ryding,et al.  A Reference Grammar of Modern Standard Arabic , 2005 .

[82]  Patrick Watrin,et al.  An N-gram Frequency Database Reference to Handle MWE Extraction in NLP Applications , 2011, MWE@ACL.

[83]  Joakim Nivre,et al.  Multiword Units in Syntactic Parsing , 2004 .

[84]  D ManningChristopher,et al.  Parsing models for identifying multiword expressions , 2013 .

[85]  Noam Chomsky Lectures on Government and Binding: The Pisa Lectures , 1993 .

[86]  Mohammed A. Attia Accommodating Multiword Expressions in an Arabic LFG Grammar , 2006, FinTAL.

[87]  Josef van Genabith,et al.  Exploiting Multi-Word Units in History-Based Probabilistic Generation , 2007, EMNLP-CoNLL.