Multiword Expression Identification with Tree Substitution Grammars: A Parsing tour de force with French

Multiword expressions (MWE), a known nuisance for both linguistics and NLP, blur the lines between syntax and semantics. Previous work on MWE identification has relied primarily on surface statistics, which perform poorly for longer MWEs and cannot model discontinuous expressions. To address these problems, we show that even the simplest parsing models can effectively identify MWEs of arbitrary length, and that Tree Substitution Grammars achieve the best results. Our experiments show a 36.4% F1 absolute improvement for French over an n-gram surface statistics baseline, currently the predominant method for MWE identification. Our models are useful for several NLP tasks in which MWE pre-grouping has improved accuracy.

[1]  Josef van Genabith,et al.  Exploiting Multi-Word Units in History-Based Probabilistic Generation , 2007, EMNLP-CoNLL.

[2]  Dekang Lin,et al.  Automatic Identification of Non-compositional Phrases , 1999, ACL.

[3]  Abhishek Arun,et al.  Statistical Parsing of the French Treebank , 2004 .

[4]  Eric Wehrli,et al.  Parsing and Collocations , 2000, Natural Language Processing.

[5]  Dan Klein,et al.  Type-Based MCMC , 2010, HLT-NAACL.

[6]  Violeta Seretan Syntax-Based Collocation Extraction , 2010 .

[7]  Phil Blunsom,et al.  Inducing Tree-Substitution Grammars , 2010, J. Mach. Learn. Res..

[8]  Phil Blunsom,et al.  Inducing Compact but Accurate Tree-Substitution Grammars , 2009, NAACL.

[9]  Pascal Denis,et al.  Statistical French Dependency Parsing: Treebank Conversion and First Results , 2010, LREC.

[10]  Donald E. Knuth Two notes on notation , 1992 .

[11]  Marie Candito,et al.  Improving generative statistical parsing with semi-supervised word clustering , 2009, IWPT.

[12]  Anne Abeillé,et al.  Parsing French with Tree Adjoining Grammar: some linguistic accounts , 1988, COLING.

[13]  Dan Klein,et al.  Learning Accurate, Compact, and Interpretable Tree Annotation , 2006, ACL.

[14]  Timothy Baldwin,et al.  Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.

[15]  Stefan Evert,et al.  Multiword expressions: hard going or plain sailing? , 2010, Lang. Resour. Evaluation.

[16]  Marine Carpuat,et al.  Task-based Evaluation of Multiword Expressions: a Pilot Study in Statistical Machine Translation , 2010, NAACL.

[17]  Geoffrey Sampson,et al.  A test of the leaf-ancestor metric for parse accuracy , 2003, Natural Language Engineering.

[18]  Djamé Seddah,et al.  Exploring the Spinal-STIG Model for Parsing French , 2010, LREC.

[19]  SmadjaFrank Retrieving collocations from text , 1993 .

[20]  Marie Candito,et al.  Expériences d’analyse syntaxique statistique du français , 2008, JEPTALNRECITAL.

[21]  Rens Bod,et al.  A Computational Model of Language Performance: Data Oriented Parsing , 1992, COLING.

[22]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[23]  Frank Keller,et al.  Lexicalization in Crosslinguistic Probabilistic Parsing: The Case of French , 2005, ACL.

[24]  Joakim Nivre,et al.  Multiword Units in Syntactic Parsing , 2004 .

[25]  Anne Abeillé,et al.  Parsing Idioms in Lexicalized TAGs , 1989, EACL.

[26]  Joshua B. Tenenbaum,et al.  Fragment Grammars: Exploring Computation and Reuse in Language , 2009 .

[27]  C. Antoniak Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems , 1974 .

[28]  Pavel Pecina,et al.  Lexical association measures and collocation extraction , 2009, Lang. Resour. Evaluation.

[29]  Dan Klein,et al.  Simple, Accurate Parsing with an All-Fragments Grammar , 2010, ACL.

[30]  Maurice Gross,et al.  Lexicon - Grammar The Representation of Compound Words , 1986, COLING.

[31]  K. Vijay-Shanker,et al.  The Use of Shared Forests in Tree Adjoining Grammar Parsing , 1993, EACL.

[32]  Alexandra Kinyon,et al.  Building a Treebank for French , 2000, LREC.

[33]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[34]  Marie Candito,et al.  Cross parser evaluation and tagset variation: a French treebank study , 2009 .

[35]  Josef van Genabith,et al.  Preparing, restructuring, and augmenting a French treebank:lexicalised parsers or coherent treebanks? , 2007 .

[36]  Matt Post,et al.  Bayesian Learning of a Tree Substitution Grammar , 2009, ACL.

[37]  M. West,et al.  Hyperparameter estimation in Dirichlet process mixture models , 1992 .

[38]  Carlos Ramisch,et al.  mwetoolkit: a Framework for Multiword Expression Identification , 2010, LREC.

[39]  Roger Levy,et al.  Tregex and Tsurgeon: tools for querying and manipulating tree data structures , 2006, LREC.

[40]  Frank Smadja,et al.  Retrieving Collocations from Text: Xtract , 1993, CL.