Extraction of Multi-word Expressions from Small Parallel Corpora

We present a general, novel methodology for extracting multi-word expressions (MWEs) of various types, along with their translations, from small, word-aligned parallel corpora. Unlike existing approaches, we focus on misalignments; these typically indicate expressions in the source language that are translated to the target in a non-compositional way. We introduce a simple algorithm that proposes MWE candidates based on such misalignments, relying on 1:1 alignments as anchors that delimit the search space. We use a large monolingual corpus to rank and filter these candidates. Evaluation of the quality of the extraction algorithm reveals significant improvements over naive alignment-based methods. The extracted MWEs, with their translations, are used in the training of a statistical machine translation system, showing a small but significant improvement in its performance.

[1]  I. Dan Melamed,et al.  Models of translation equivalence among words , 2000, CL.

[2]  Aravind K. Joshi,et al.  Using Information about Multi-word Expressions for the Word-Alignment Task , 2006 .

[3]  Philip Resnik,et al.  Mining the Web for Bilingual Text , 1999, ACL.

[4]  van Gerardus Noord,et al.  Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010) , 2010 .

[5]  Jian-Yun Nie,et al.  Automatic construction of parallel English-Chinese corpus for cross-language information retrieval , 2000, ANLP.

[6]  Jonas Kuhn,et al.  Exploiting Translational Correspondences for Pattern-Independent MWE Identification , 2009, MWE@IJCNLP.

[7]  Yulia Tsvetkov,et al.  Automatic Acquisition of Parallel Corpora from Websites with Dynamic Content , 2010, LREC.

[8]  No Value,et al.  Proceedings of RANLP 2005 , 2005 .

[9]  Qun Liu,et al.  Improving Statistical Machine Translation Using Domain Bilingual Multiword Expressions , 2009, MWE@IJCNLP.

[10]  Alon Itai,et al.  Language resources for Hebrew , 2008, Lang. Resour. Evaluation.

[11]  Timothy Baldwin,et al.  Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.

[12]  Aline Villavicencio,et al.  Statistically-Driven Alignment-Based Multiword Expression Identification for Technical Domains , 2009, MWE@IJCNLP.

[13]  Mark Steedman,et al.  Building Deep Dependency Structures using a Wide-Coverage CCG Parser , 2002, ACL.

[14]  Timothy Baldwin,et al.  Translation by Machine of Complex Nominals: Getting it Right , 2004 .

[15]  Shailaja Venkatsubramanyan,et al.  Multiword Expression Filtering for Building Knowledge , 2004, Workshop On Multiword Expressions: Integrating Processing.

[16]  Eugenie Giesbrecht,et al.  Automatic Identification of Non-Compositional Multi-Word Expressions using Latent Semantic Analysis , 2006 .

[17]  Timothy Baldwin,et al.  A Statistical Approach to the Semantics of Verb-Particles , 2003, ACL 2003.

[18]  Baobao Chang,et al.  Extraction of Translation Unit from Chinese-English Parallel Corpora , 2002, SIGHAN@COLING.

[19]  Ray Jackendoff,et al.  The Architecture of the Language Faculty , 1996 .

[20]  Hassan Al-Haj Hebrew Multiword Expressions: Linguistic Properties, Lexical Representation, Morphological Processing, and Automatic Acquisition , 2009 .

[21]  Mike Rosner,et al.  Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages , 1998 .

[22]  Antoine Doucet,et al.  Non-Contiguous Word Sequences for Information Retrieval , 2004 .

[23]  Carlos Ramisch,et al.  Validation and Evaluation of Automatically Acquired Multiword Expressions for Grammar Engineering , 2007, EMNLP.

[24]  Khalil Sima'an,et al.  Choosing an Optimal Architecture for Segmentation and POS-Tagging of Modern Hebrew , 2005, SEMITIC@ACL.

[25]  Alon Lavie,et al.  Rapid prototyping of a transfer-based Hebrew-to-English machine translation system , 2004, TMI.

[26]  Shuly Wintner,et al.  Identifying Multi-word Expressions by Leveraging Morphological and Syntactic Idiosyncrasy , 2010, COLING.

[27]  Dawn Archer,et al.  Comparing and combining a semantic tagger and a statistical tool for MWE extraction , 2005, Comput. Speech Lang..

[28]  T. Van de Cruys,et al.  Proceedings of the Workshop on A Broader Perspective on Multiword Expressions , 2007 .

[29]  Jörg Tiedemann,et al.  Identifying idiomatic expressions using automatic word-alignment , 2006 .

[30]  Afsaneh Fazly,et al.  Pulling their Weight: Exploiting Syntactic Forms for the Automatic Identification of Idiomatic Expressions in Context , 2007 .

[31]  I. D. Melamed Measuring Semantic Entropy , 1997 .

[32]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[33]  Éric Gaussier,et al.  Reducing Parameter Space for Word Alignment , 2003, ParallelTexts@NAACL-HLT.

[34]  Marine Carpuat,et al.  Task-based Evaluation of Multiword Expressions: a Pilot Study in Statistical Machine Translation , 2010, NAACL.

[35]  Giorgio Satta,et al.  Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing , 2011 .

[36]  Christian Chiarcos,et al.  Von der Form zur Bedeutung: Texte automatisch verarbeiten/From Form to Meaning: Processing Texts Automatically , 2009 .

[37]  Philip Resnik,et al.  Parallel strands: a preliminary investigation into mining the Web for bilingual text , 1998, AMTA.

[38]  Noah A. Smith,et al.  The Web as a Parallel Corpus , 2003, CL.

[39]  Frank Smadja,et al.  Retrieving Collocations from Text: Xtract , 1993, CL.

[40]  András Kornai,et al.  Parallel corpora for medium density languages , 2007 .

[41]  Gerlof Bouma,et al.  Normalized (pointwise) mutual information in collocation extraction , 2009 .

[42]  Preslav Nakov,et al.  Search Engine Statistics Beyond the n-Gram: Application to Noun Compound Bracketing , 2005, CoNLL.

[43]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[44]  Preslav Nakov,et al.  Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications, MWE@IJCNLP 2009, Singapore, August 6, 2009 , 2009, MWE@IJCNLP.

[45]  Shuly Wintner,et al.  A General Method for Creating a Bilingual Transliteration Dictionary , 2010, LREC.

[46]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[47]  Shuly Wintner,et al.  Language Models for Machine Translation: Original vs. Translated Texts , 2011, CL.

[48]  Rafael E. Banchs,et al.  Data Inferred Multi-word Expressions for Statistical Machine Translation , 2005 .

[49]  Oi Yee Kwong,et al.  Some Considerations on Guidelines for Bilingual Alignment and Terminology Extraction , 2002, SIGHAN@COLING.

[50]  Alon Lavie,et al.  The significance of recall in automatic metrics for MT evaluation , 2004, AMTA.

[51]  Barbara Plank,et al.  Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10) , 2010 .

[52]  Shailaja Venkatsubramanyan,et al.  Multiword expression filtering for building knowledge maps , 2004, ACL 2004.

[53]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[54]  Timothy Baldwin,et al.  Disambiguating Japanese compound verbs , 2005, Comput. Speech Lang..

[55]  B. Erman,et al.  The idiom principle and the open choice principle , 2000 .

[56]  Eduard Hovy,et al.  Proceedings of the Third Conference of the Association for Machine Translation in the Americas on Machine Translation and the Information Soup , 1998 .

[57]  Pavel Pecina AMachine Learning Approach to Multiword Expression Extraction , 2008 .

[58]  Timothy Baldwin,et al.  An Empirical Model of Multiword Expression Decomposability , 2003, ACL 2003.

[59]  Xiaoyi Ma,et al.  BITS: a method for bilingual text search over the Web , 1999, MTSUMMIT.

[60]  John A. Carroll,et al.  Applied morphological processing of English , 2001, Natural Language Engineering.

[61]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[62]  Stefan Evert,et al.  Proceedings of the Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties , 2006 .

[63]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[64]  Tim van de Cruys,et al.  Semantics-based Multiword Expression Extraction , 2007 .