A Hybrid Approach for Multiword Expression Identification

Considerable attention has been given to the problem of Multiword Expression (MWE) identification and treatment, for NLP tasks like parsing and generation, to improve the quality of results. Statistical methods have been often employed for MWE identification, as an inexpensive and language independent way of finding co-occurrence patterns. On the other hand, more linguistically motivated methods for identification, which employ information such as POS filters and lexical alignment between languages, can produce more targeted candidate lists. In this paper we propose a hybrid approach that combines the strenghts of different sources of information using a machine learning algorithm to produce more robust and precise results. Automatic evaluation on gold standards shows that the performance of our hybrid method is superior to the individual results of statistical and alignment-based MWE extraction approaches for Portuguese and for English. This method can be used to aid lexicographic work by providing a more targeted MWE candidate list.

[1]  Renata Vieira,et al.  Automatic extraction of composite terms for construction of ontologies: an experiment in the health care area , 2009 .

[2]  Carlos Ramisch,et al.  Alignment-based extraction of multiword expressions , 2010, Lang. Resour. Evaluation.

[3]  Stefan Evert,et al.  Using small random samples for the manual evaluation of statistical association measures , 2005, Comput. Speech Lang..

[4]  Tim van de Cruys,et al.  Semantics-based Multiword Expression Extraction , 2007 .

[5]  Thiago Alexandre Salgueiro Pardo,et al.  Computational Processing of the Portuguese Language - 11th International Conference, PROPOR 2014, São Carlos/SP, Brazil, October 6-8, 2014. Proceedings , 2014, Lecture Notes in Computer Science.

[6]  Timothy Baldwin,et al.  Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.

[7]  Mikel L. Forcada,et al.  On the Automatic Learning of Bilingual Resources: Some Relevant Factors for Machine Translation , 2008, SBIA.

[8]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[9]  Timothy Baldwin,et al.  A Resource for Evaluating the Deep Lexical Acquisition of English Verb-Particle Constructions , 2008, LREC 2008.

[10]  Gerson Zaverucha,et al.  Advances in Artificial Intelligence - SBIA 2008, 19th Brazilian Symposium on Artificial Intelligence, Savador, Brazil, October 26-30, 2008. Proceedings , 2008, SBIA.

[11]  D. Biber,et al.  Longman Grammar of Spoken and Written English , 1999 .

[12]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[13]  Robert James Coulthard The application of corpus methodology to translation: the JPED parallel corpus and the Pediatrics comparable corpus , 2005 .

[14]  Hermann Ney,et al.  Improved Statistical Alignment Models , 2000, ACL.

[15]  Mikel L. Forcada,et al.  Open-Source Portuguese-Spanish Machine Translation , 2006, PROPOR.

[16]  Paul Procter,et al.  Cambridge international dictionary of English , 2000 .

[17]  Satanjeev Banerjee,et al.  The Design, Implementation, and Use of the Ngram Statistics Package , 2003, CICLing.

[18]  Timothy Baldwin,et al.  Deep lexical acquisition of verb-particle constructions , 2005, Comput. Speech Lang..

[19]  Aline Villavicencio,et al.  Statistically-Driven Alignment-Based Multiword Expression Identification for Technical Domains , 2009, MWE@IJCNLP.

[20]  Ray Jackendoff TWISTIN' THE NIGHT AWAY , 1997 .

[21]  Aline Villavicencio,et al.  Identification of Multiword Expressions in Technical Domains: Investigating Statistical and Alignment-Based Approaches , 2009, 2009 Seventh Brazilian Symposium in Information and Human Language Technology.

[22]  Afsaneh Fazly,et al.  Unsupervised Type and Token Identification of Idiomatic Expressions , 2009, CL.

[23]  Carlos Ramisch,et al.  Picking them up and Figuring them out: Verb-Particle Constructions, Noise and Idiomaticity , 2008, CoNLL.

[24]  Jörg Tiedemann,et al.  Identifying idiomatic expressions using automatic word-alignment , 2006 .

[25]  I. Dan Melamed Automatic Discovery of Non-Compositional Compounds in Parallel Data , 1997, EMNLP.