Discriminative Lexical Semantic Segmentation with Gaps: Running the MWE Gamut

We present a novel representation, evaluation measure, and supervised models for the task of identifying the multiword expressions (MWEs) in a sentence, resulting in a lexical semantic segmentation. Our approach generalizes a standard chunking representation to encode MWEs containing gaps, thereby enabling efficient sequence tagging algorithms for feature-rich discriminative models. Experiments on a new dataset of English web text offer the first linguistically-driven evaluation of MWE identification with truly heterogeneous expression types. Our statistical sequence model greatly outperforms a lookup-based segmentation procedure, achieving nearly 60% F1 for MWE identification.

[1]  Xavier Carreras,et al.  Simple Semi-supervised Dependency Parsing , 2008, ACL.

[2]  Timothy Baldwin,et al.  Multiword Expressions , 2010, Handbook of Natural Language Processing.

[3]  Carlos Ramisch,et al.  A Generic Framework for Multiword Expressions Treatment: from Acquisition to Applications , 2012, ACL 2012.

[4]  Kemal Oflazer,et al.  Recall-Oriented Learning of Named Entities in Arabic Wikipedia , 2012, EACL.

[5]  Veronika Vincze Light Verb Constructions in the SzegedParalellFX English-Hungarian Parallel Corpus , 2012, LREC.

[6]  Daniel Marcu,et al.  Practical structured learning techniques for natural language processing , 2006 .

[7]  Kevin Duh,et al.  Managing information disparity in multilingual document collections , 2013, TSLP.

[8]  Patrick Watrin,et al.  Discriminative Strategies to Integrate Multiword Expression Recognition and Parsing , 2012, ACL.

[9]  George A. Miller,et al.  A Semantic Concordance , 1993, HLT.

[10]  Noah A. Smith,et al.  Generative Models of Monolingual and Bilingual Gappy Patterns , 2011, WMT@EMNLP.

[11]  Pavel Pecina,et al.  Lexical association measures and collocation extraction , 2009, Lang. Resour. Evaluation.

[12]  Veronika Vincze,et al.  Learning to detect english and hungarian light verb constructions , 2013, TSLP.

[13]  Matthieu Constant,et al.  MWU-Aware Part-of-Speech Tagging with a CRF Model and Lexical Resources , 2011, MWE@ACL.

[14]  Iryna Gurevych,et al.  Mining Multiword Terms from Wikipedia , 2012 .

[15]  I. Sag,et al.  Idioms , 2015 .

[16]  Daisuke Kawahara,et al.  Construction of an Idiom Corpus and its Application to Idiom Identification based on WSD Incorporating Idiom-Specific Features , 2008, EMNLP.

[17]  Timothy Baldwin,et al.  Bayesian Text Segmentation for Index Term Identification and Keyphrase Extraction , 2012, COLING.

[18]  R. Moon Fixed Expressions and Idioms in English: A Corpus-Based Approach , 1998 .

[19]  Percy Liang,et al.  Semi-Supervised Learning for Natural Language , 2005 .

[20]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[21]  Marine Carpuat,et al.  Task-based Evaluation of Multiword Expressions: a Pilot Study in Statistical Machine Translation , 2010, NAACL.

[22]  Suzanne Stevenson,et al.  The VNC-Tokens Dataset , 2008 .

[23]  Timothy Baldwin,et al.  Multilingual Deep Lexical Acquisition for HPSGs via Supertagging , 2006, EMNLP.

[24]  C. Fillmore,et al.  Regularity and Idiomaticity in Grammatical Constructions: The Case of Let Alone , 1988 .

[25]  Christopher D. Manning,et al.  Multiword Expression Identification with Tree Substitution Grammars: A Parsing tour de force with French , 2011, EMNLP.

[26]  Gábor Berend,et al.  Opinion Expression Mining by Exploiting Keyphrase Extraction , 2011, IJCNLP.

[27]  Yuji Matsumoto,et al.  Construction of English MWE Dictionary and its Application to POS Tagging , 2013, MWE@NAACL-HLT.

[28]  Kenneth Ward Church,et al.  Using Statistics in Lexical Analysis , 2003, Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon.

[29]  Timothy Baldwin,et al.  Looking for Prepositional Verbs in Corpus Data , 2005, ACL 2005.

[30]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[31]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[32]  Alexandra Kinyon,et al.  Building a Treebank for French , 2000, LREC.

[33]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[34]  Wiebke Wagner,et al.  Steven Bird, Ewan Klein and Edward Loper: Natural Language Processing with Python, Analyzing Text with the Natural Language Toolkit , 2010, Lang. Resour. Evaluation.

[35]  Carlos Ramisch,et al.  A Broad Evaluation of Techniques for Automatic Acquisition of Multiword Expressions , 2012, ACL 2012.

[36]  Uri Zernik,et al.  Lexical acquisition: Exploiting on-line resources to build a lexicon. , 1991 .

[37]  Carlos Ramisch,et al.  mwetoolkit: a Framework for Multiword Expression Identification , 2010, LREC.

[38]  Mona T. Diab,et al.  Multiword Expressions in the Context of Statistical Machine Translation , 2013, IJCNLP.

[39]  I. Dan Melamed Automatic Discovery of Non-Compositional Compounds in Parallel Data , 1997, EMNLP.

[40]  A. Goldberg Constructions at Work: The Nature of Generalization in Language , 2006 .

[41]  Scott Miller,et al.  Name Tagging with Word Clusters and Discriminative Training , 2004, NAACL.

[42]  Jan Hajič,et al.  Prague Czech-English dependency treebank: resource for structure-based MT , 2005, EAMT.

[43]  Francis R. Bach,et al.  Hidden Markov tree models for semantic class induction , 2013, CoNLL.

[44]  Yulia Tsvetkov,et al.  Identification of Multiword Expressions by Combining Multiple Linguistic Information Sources , 2014, Computational Linguistics.

[45]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[46]  Timothy Baldwin,et al.  A Resource for Evaluating the Deep Lexical Acquisition of English Verb-Particle Constructions , 2008, LREC 2008.

[47]  Oren Etzioni,et al.  Named Entity Recognition in Tweets: An Experimental Study , 2011, EMNLP.

[48]  Noah A. Smith,et al.  Comprehensive Annotation of Multiword Expressions in a Social Web Corpus , 2014, LREC.

[49]  Christopher D. Manning,et al.  Parsing Models for Identifying Multiword Expressions , 2013, CL.

[50]  Eduard Bejcek,et al.  Syntactic Identification of Occurrences of Multiword Expressions in Text using a Lexicon with Dependency Structures , 2013, MWE@NAACL-HLT.

[51]  Gerhard Paass,et al.  Exploiting Semantic Constraints for Estimating Supersenses with CRFs , 2009, SDM.

[52]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[53]  N. Ellis,et al.  Formulaic Language in Native and Second Language Speakers: Psycholinguistics, Corpus Linguistics, and TESOL , 2008 .

[54]  Dan Roth,et al.  Learning English Light Verb Constructions: Contextual or Statistical , 2011, MWE@ACL.

[55]  Mona T. Diab,et al.  Verb Noun Construction MWE Token Classification , 2009, MWE@IJCNLP.

[56]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT' 98.

[57]  Franziska Frankfurter,et al.  Constructions: A construction grammar approach to argument structure: Adele E. Goldberg, Chicago, IL: The University of Chicago Press, 1995. xi + 265 pp , 1998 .

[58]  M. R E C A S E,et al.  BLANC: Implementing the Rand index for coreference evaluation , 2010, Natural Language Engineering.

[59]  Pat Lochungvu Chiang Mai, Thailand , 2012, The Statesman’s Yearbook Companion.

[60]  Treebank Penn,et al.  Linguistic Data Consortium , 1999 .

[61]  Jörg Tiedemann,et al.  Identifying idiomatic expressions using automatic word-alignment , 2006 .

[62]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[63]  Yulia Tsvetkov,et al.  Extraction of Multi-word Expressions from Small Parallel Corpora , 2010, COLING.

[64]  Yasemin Altun,et al.  Broad-Coverage Sense Disambiguation and Information Extraction with a Supersense Sequence Tagger , 2006, EMNLP.

[65]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[66]  Brendan T. O'Connor,et al.  Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters , 2013, NAACL.

[67]  Timothy Baldwin,et al.  Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.

[68]  WagnerWiebke Steven Bird, Ewan Klein and Edward Loper , 2010, LREC 2010.

[69]  Lynette Hirschman,et al.  A Model-Theoretic Coreference Scoring Scheme , 1995, MUC.

[70]  Dan Roth,et al.  Sorting out the Most Confusing English Phrasal Verbs , 2012, *SEM@NAACL-HLT.

[71]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[72]  James W. Thatcher,et al.  Characterizing Derivation Trees of Context-Free Grammars through a Generalization of Finite Automata Theory , 1967, J. Comput. Syst. Sci..

[73]  Carlos Ramisch,et al.  A generic and open framework for multiword expressions treatment: from acquisition to applications. (Un environnement générique et ouvert pour le traitement des expressions polylexicales) , 2012 .