Improving Word Translation Disambiguation by Capturing Multiword Expressions with Dictionaries

The paper describes a method for identifying and translating multiword expressions using a bi-directional dictionary. While a dictionarybased approach suffers from limited recall, precision is high; hence it is best employed alongside an approach with complementing properties, such as an n-gram language model. We evaluate the method on data from the English-German translation part of the crosslingual word sense disambiguation task in the 2010 semantic evaluation exercise (SemEval). The output of a baseline disambiguation system based on n-grams was substantially improved by matching the target words and their immediate contexts against compound and collocational words in a dictionary.

[1]  Masatoshi Yoshikawa,et al.  Learning bilingual translations from comparable corpora to cross-language information retrieval: hybrid statistics-based and linguistics-based approach , 2003, IRAL.

[2]  Mauro Cettolo,et al.  IRSTLM: an open source toolkit for handling large scale language models , 2008, INTERSPEECH.

[3]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[4]  Timothy Baldwin,et al.  Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.

[5]  Véronique Hoste,et al.  SemEval-2010 Task 3: Cross-Lingual Word Sense Disambiguation , 2010, SemEval@ACL.

[6]  Véronique Hoste,et al.  Construction of a Benchmark Data Set for Cross-lingual Word Sense Disambiguation , 2010, LREC.

[7]  Carlos Ramisch,et al.  Alignment-based extraction of multiword expressions , 2010, Lang. Resour. Evaluation.

[8]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[9]  M. T. Lino,et al.  Proceedings of the 4th International Conference on Language Resources and Evaluation , 2004 .

[10]  Tomaz Erjavec,et al.  The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages , 2006, LREC.

[11]  Mauro Cettolo,et al.  Efficient Handling of N-gram Language Models for Statistical Machine Translation , 2007, WMT@ACL.

[12]  Keh-Jiann Chen,et al.  Acquiring Translation Equivalences of Multiword Expressions by Normalized Correlation Frequencies , 2009, EMNLP.

[13]  Adam Kilgarriff,et al.  Large Linguistically-Processed Web Corpora for Multiple Languages , 2006, EACL.

[14]  Björn Gambäck,et al.  Disambiguating Word Translations with Target Language Models , 2012, TSD.

[15]  Reinhard Rapp,et al.  Automatic Identification of Word Translations from Unrelated English and German Corpora , 1999, ACL.

[16]  Serge Sharoff,et al.  Using collocations from comparable corpora to find translation equivalents , 2006, LREC.

[17]  Björn Gambäck,et al.  Word Translation Disambiguation without Parallel Texts ∗ , 2011 .

[18]  George Tambouratzis,et al.  Implementing a Language-Independent MT Methodology , 2012 .

[19]  Qun Liu,et al.  Improving Statistical Machine Translation Using Domain Bilingual Multiword Expressions , 2009, MWE@IJCNLP.

[20]  Pierre Zweigenbaum,et al.  Looking for Candidate Translational Equivalents in Specialized, Comparable Corpora , 2002, COLING.