mwetoolkit: a Framework for Multiword Expression Identification

This paper presents the Multiword Expression Toolkit (mwetoolkit), an environment for type and language-independent MWE identification from corpora. The mwetoolkit provides a targeted list of MWE candidates, extracted and filtered according to a number of user-defined criteria and a set of standard statistical association measures. For generating corpus counts, the toolkit provides both a corpus indexation facility and a tool for integration with web search engines, while for evaluation, it provides validation and annotation facilities. The mwetoolkit also allows easy integration with a machine learning tool for the creation and application of supervised MWE extraction models if annotated data is available. In our experiment, the mwetoolkit was tested and evaluated in the context of MWE extraction in the biomedical domain. Our preliminary results show that the toolkit performs better than other approaches, especially concerning recall. Moreover, this first version can also be extended in several ways in order to improve the quality of the results.

[1]  Darren Pearce A Comparative Evaluation of Collocation Extraction Techniques , 2002, LREC.

[2]  Timothy Baldwin,et al.  Multiword expressions: linguistic precision and reusability , 2002, LREC.

[3]  Ted Briscoe,et al.  High Precision Extraction of Grammatical Relations , 2001, COLING.

[4]  Carlos Ramisch,et al.  Picking them up and Figuring them out: Verb-Particle Constructions, Noise and Idiomaticity , 2008, CoNLL.

[5]  Aline Villavicencio,et al.  The availability of verb-particle constructions in lexical resources: How much is enough? , 2005, Comput. Speech Lang..

[6]  Nirvana Ferraz Santos Sampaio,et al.  Introdução à terminologia: teoria e prática , 2005 .

[7]  Stefan Evert,et al.  Using small random samples for the manual evaluation of statistical association measures , 2005, Comput. Speech Lang..

[8]  Gaël Dias,et al.  Multiword Unit Hybrid Extraction , 2003, ACL 2003.

[9]  Frank Keller,et al.  Using the Web to Overcome Data Sparseness , 2002, EMNLP.

[10]  Carlos Ramisch,et al.  Validation and Evaluation of Automatically Acquired Multiword Expressions for Grammar Engineering , 2007, EMNLP.

[11]  Ray Jackendoff TWISTIN' THE NIGHT AWAY , 1997 .

[12]  Aline Villavicencio,et al.  Automated Multiword Expression Prediction for Grammar Engineering , 2006 .

[13]  Caroline Hagège,et al.  Linguistic Processing of Biomedical Texts , 2002, PorTAL.

[14]  Hideki Mima,et al.  Automatic recognition of multi-word terms:. the C-value/NC-value method , 2000, International Journal on Digital Libraries.

[15]  Timothy Baldwin,et al.  Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.

[16]  Timothy Baldwin,et al.  Deep lexical acquisition of verb-particle constructions , 2005, Comput. Speech Lang..

[17]  Ralph Grishman,et al.  Towards Best Practice for Multiword Expressions in Computational Lexicons , 2002, LREC.

[18]  Aline Villavicencio,et al.  Multi-word terminology extraction for domain-specific documents , 2009 .

[19]  Thierry Poibeau,et al.  LexSchem: a Large Subcategorization Lexicon for French Verbs , 2008, LREC.

[20]  Frank Smadja,et al.  Retrieving Collocations from Text: Xtract , 1993, CL.

[21]  Slava M. Katz,et al.  Technical terminology: some linguistic properties and an algorithm for identification in text , 1995, Natural Language Engineering.

[22]  Ted Briscoe,et al.  A System for Large-Scale Acquisition of Verbal, Nominal and Adjectival Subcategorization Frames from Corpora , 2007, ACL.

[23]  Jin-Dong Kim,et al.  The GENIA corpus: an annotated research abstract corpus in molecular biology domain , 2002 .