A Hybrid Approach for Automatic Extraction of Bilingual Multiword Expressions from Parallel Corpora

Specific-domain bilingual lexicons play an importan t role for domain adaptation in machine translation . The entries of these types of lexicons are mostly composed of MultiWord Expressio n (MWEs). The manual construction of MWEs bilingua l lexicons is costly and time-consuming. We often use word alignment approac hes to automatically construct bilingual lexicons o f MWEs from parallel corpora. We present in this paper a hybrid approach to extra ct and align MWEs from parallel corpora in a one-st p process. We formalize the alignment process as an integer linear programming problem in order to find an approximated optimal so luti n. This process generates lists of MWEs with their translations, which are th en filtered using linguistic patterns for the const ruc ion of the bilingual lexicons of MWEs. We evaluate the bilingual lexicons of MWEs pr oduced by this approach using two methods: a manual evaluation of the alignment quality and an evaluation of the impact of this ali gnment on the translation quality of the phrase-bas ed statistical machine translation system Moses. We experimentally show that the integ ra ion of the bilingual MWEs and their linguistic i nformation into the translation model improves the performance of Moses.

[1]  Nasredine Semmar,et al.  Building Multiword Expressions Bilingual Lexicons for Domain Adaptation of an Example-Based Machine Translation System , 2017, RANLP.

[2]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[3]  Julian Kupiec,et al.  An Algorithm for Finding Noun Phrase Correspondences in Bilingual Corpora , 1993, ACL.

[4]  Andy Way,et al.  Multi-Word Expression-Sensitive Word Alignment , 2010 .

[5]  R. Moon Fixed Expressions and Idioms in English: A Corpus-Based Approach , 1998 .

[6]  Ray Jackendoff,et al.  The Architecture of the Language Faculty , 1996 .

[7]  Nasredine Semmar,et al.  Une approche hybride pour la construction de lexiques bilingues d'expressions multi-mots à partir de corpus parallèles (A hybrid approach to build bilingual lexicons of multiword expressions from parallel corpora) , 2017, TALN.

[8]  Darren Pearce A Comparative Evaluation of Collocation Extraction Techniques , 2002, LREC.

[9]  Ted Pedersen,et al.  An Evaluation Exercise for Word Alignment , 2003, ParallelTexts@NAACL-HLT.

[10]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[11]  Susanne Z. Riehemann,et al.  A constructional approach to idioms and word formation , 2001 .

[12]  D. Tufi,et al.  PARALLEL CORPORA , ALIGNMENT TECHNOLOGIES AND FURTHER PROSPECTS IN MULTILINGUAL RESOURCES AND TECHNOLOGY INFRASTRUCTURE , 2008 .

[13]  Philipp Koehn,et al.  Factored Translation Models , 2007, EMNLP.

[14]  William J. Byrne,et al.  HMM Word and Phrase Alignment for Statistical Machine Translation , 2005, HLT.

[15]  Jason S. Chang,et al.  Bilingual Collocation Extraction Based on Syntactic and Statistical Analyses , 2003, ROCLING/IJCLCLP.

[16]  Timothy Baldwin,et al.  Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.

[17]  Tim Van de Cruys,et al.  Semantics-based Multiword Expression Extraction , 2007 .

[18]  Carlos Ramisch,et al.  Alignment-based extraction of multiword expressions , 2010, Lang. Resour. Evaluation.

[19]  Driss Aboutajdine,et al.  A Multi-Word Term Extraction Program for Arabic Language , 2008, LREC.

[20]  Tim van de Cruys,et al.  Semantics-based Multiword Expression Extraction , 2007 .

[21]  Ana-Maria Barbu,et al.  Simple linguistic methods for improving a word alignment algorithm , 2004 .

[22]  Ingeborg Blank,et al.  Terminology extraction from parallel technical texts , 2000 .

[23]  Eric Wehrli,et al.  Collocation translation based on sentence alignment and parsing , 2007, JEPTALNRECITAL.

[24]  Carlos Ramisch,et al.  Validation and Evaluation of Automatically Acquired Multiword Expressions for Grammar Engineering , 2007, EMNLP.

[25]  Aline Villavicencio,et al.  Automated Multiword Expression Prediction for Grammar Engineering , 2006 .

[26]  Martin Volk,et al.  Automatic Phrase Alignment: Using Statistical N-Gram Alignment for Syntactic Phrase Alignment , 2007 .

[27]  Ulrich Germann,et al.  Yawat: Yet Another Word Alignment Tool , 2008, ACL.

[28]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[29]  Éric Gaussier,et al.  Towards Automatic Extraction of Monolingual and Bilingual Terminology , 1994, COLING.

[30]  Alexander M. Fraser,et al.  Squibs and Discussions: Measuring Word Alignment Quality for Statistical Machine Translation , 2007, CL.

[31]  Stefan Evert,et al.  Using small random samples for the manual evaluation of statistical association measures , 2005, Comput. Speech Lang..

[32]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[33]  Darja Fiser,et al.  Harvesting Multi-Word Expressions from Parallel Corpora , 2008, LREC.

[34]  Christopher D. Manning,et al.  A Phrase-Based Alignment Model for Natural Language Inference , 2008, EMNLP.

[35]  Kenneth Ward Church,et al.  Termight: Identifying and Translating Technical Terminology , 1994, ANLP.

[36]  Véronique Hoste,et al.  Language-Independent Bilingual Terminology Extraction from a Multilingual Parallel Corpus , 2009, EACL.

[37]  Timothy Baldwin,et al.  Extracting the Unextractable: A Case Study on Verb-particles , 2002, CoNLL.

[38]  Hermann Ney,et al.  Improved Statistical Alignment Models , 2000, ACL.

[39]  John DeNero,et al.  The Complexity of Phrase Alignment Problems , 2008, ACL.

[40]  Nasredine Semmar,et al.  A Hybrid Word Alignment Approach to Improve Translation Lexicons with Compound Words and Idiomatic Expressions , 2010 .