Alignment-based extraction of multiword expressions

Due to idiosyncrasies in their syntax, semantics or frequency, Multiword Expressions (MWEs) have received special attention from the NLP community, as the methods and techniques developed for the treatment of simplex words are not necessarily suitable for them. This is certainly the case for the automatic acquisition of MWEs from corpora. A lot of effort has been directed to the task of automatically identifying them, with considerable success. In this paper, we propose an approach for the identification of MWEs in a multilingual context, as a by-product of a word alignment process, that not only deals with the identification of possible MWE candidates, but also associates some multiword expressions with semantics. The results obtained indicate the feasibility and low costs in terms of tools and resources demanded by this approach, which could, for example, facilitate and speed up lexicographic work.

[1]  Maria das Graças Volpe Nunes,et al.  Evaluation of Methods for Sentence and Lexical Alignment of Brazilian Portuguese and English Parallel Texts , 2004, SBIA.

[2]  Paul Rayson,et al.  Automatic Extraction of Chinese Multiword Expressions with a Statistical Tool , 2006 .

[3]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[4]  Timothy Baldwin,et al.  Extracting the Unextractable: A Case Study on Verb-particles , 2002, CoNLL.

[5]  Hermann Ney,et al.  Improved Statistical Alignment Models , 2000, ACL.

[6]  Darren Pearce A Comparative Evaluation of Collocation Extraction Techniques , 2002, LREC.

[7]  Jörg Tiedemann,et al.  Identifying idiomatic expressions using automatic word-alignment , 2006 .

[8]  Knut Hofland A Program for Aligning English and Norwegian Sentences , 1995 .

[9]  Suzanne Stevenson,et al.  Distinguishing Subtypes of Multiword Expressions Using Linguistically-Motivated Statistical Measures , 2007 .

[10]  Timothy Baldwin,et al.  Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.

[11]  Carlos Ramisch,et al.  Validation and Evaluation of Automatically Acquired Multiword Expressions for Grammar Engineering , 2007, EMNLP.

[12]  Aline Villavicencio,et al.  The availability of verb-particle constructions in lexical resources: How much is enough? , 2005, Comput. Speech Lang..

[13]  Stefan Evert,et al.  Using small random samples for the manual evaluation of statistical association measures , 2005, Comput. Speech Lang..

[14]  Aline Villavicencio,et al.  Automated Multiword Expression Prediction for Grammar Engineering , 2006 .

[15]  Mikel L. Forcada,et al.  Automatic induction of bilingual resources from aligned parallel corpora: application to shallow-transfer machine translation , 2007, Machine Translation.

[16]  Hermann Ney,et al.  HMM-Based Word Alignment in Statistical Translation , 1996, COLING.

[17]  Ray Jackendoff TWISTIN' THE NIGHT AWAY , 1997 .

[18]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[19]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[20]  I. Dan Melamed Automatic Discovery of Non-Compositional Compounds in Parallel Data , 1997, EMNLP.

[21]  Ted Briscoe,et al.  Robust Accurate Statistical Annotation of General Text , 2002, LREC.

[22]  Tim van de Cruys,et al.  Semantics-based Multiword Expression Extraction , 2007 .

[23]  Hermann Ney,et al.  A Comparison of Alignment Models for Statistical Machine Translation , 2000, COLING.

[24]  Mikel L. Forcada,et al.  Open-Source Portuguese-Spanish Machine Translation , 2006, PROPOR.

[25]  Paul Procter,et al.  Cambridge international dictionary of English , 2000 .

[26]  Roger K. Moore Computer Speech and Language , 1986 .