Identification and Lexical Representation of Multiword Expressions

The central problems that this paper addresses are (i) the lack of large and rich formalised lexicons for multi-word expressions for use in Natural Language Processing (NLP); (ii) the lack of proper methods and tools to extend the lexicon of an NLP-system for multi-word expressions given a text corpus in a maximally automated manner. The paper describes innovative methods and tools for the automatic identification and lexical representation of multi-word expressions. In addition, it describes a 5.000 entry corpus-based multi-word expression lexical database for Dutch developed using these methods. The database has been externally validated, and its usability has been evaluated in NLP-systems for Dutch. The MWE database developed fills a gap in existing lexical resources for Dutch. The generic methods and tools for MWE identification and lexical representation focus on Dutch, but they are largely language-independent and can also be used for other languages, new domains, and beyond this project. The research results and data described in this paper contribute directly to strengthening the digital infrastructure for Dutch.

[1]  Timothy Baldwin,et al.  Multiword expressions: linguistic precision and reusability , 2002, LREC.

[2]  Timothy Baldwin,et al.  Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.

[3]  Udo Hahn,et al.  Collocation Extraction Based on Modifiability Statistics , 2004, COLING.

[4]  L. J. V. Beek,et al.  Een brede computationele grammatica voor het Nederlands , 2002 .

[5]  Michael Moortgat,et al.  Syntactische annotatie voor het Corpus Gesproken Nederlands (CGN) , 2002 .

[6]  Paola Merlo,et al.  Automatic distinction of arguments and modifiers: the case of prepositional phrases , 2001, CoNLL.

[7]  Tim Van de Cruys,et al.  Semantic Clustering in Dutch , 2005, CLIN.

[8]  Tim van de Cruys,et al.  Semantics-based Multiword Expression Extraction , 2007 .

[9]  Claudia Soria,et al.  Lexical Markup Framework (LMF) , 2006, LREC.

[10]  Jan Odijk,et al.  A proposed standard for the lexical representation of idioms , 2004 .

[11]  Nicole Grégoire,et al.  DuELME: a Dutch electronic lexicon of multiword expressions , 2010, Lang. Resour. Evaluation.

[12]  Marc Kemps-Snijders,et al.  A Data Category Registry- and Component-based Metadata Framework , 2010, LREC.

[13]  Gertjan van Noord,et al.  At Last Parsing Is Now Operational , 2006, JEPTALNRECITAL.

[14]  Nicole Grégoire,et al.  Untangling Multiword Expressions: A study on the representation and variation of Dutch multiword expressions , 2009 .

[15]  Ian Witten,et al.  Data Mining , 2000 .

[16]  Jan Odijk,et al.  Reusable Lexical Representations for Idioms , 2004, LREC.

[17]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .