Analysis and Evaluation of Comparable Corpora for Under Resourced Areas of Machine Translation

Lack of sufficient linguistic resources and parallel corpora for many languages and domains currently is one of the major obstacles to further advancement of automated translation. The solution proposed in this paper is to exploit the fact that non-parallel bi- or multilingual text resources are much more widely available than parallel translation data. This position paper presents previous research in this field and research plans of the ACCURAT project. Its goal is to find, analyze and evaluate novel methods that exploit comparable corpora in order to compensate for the shortage of linguistic resources, and ultimately to significantly improve MT quality for under-resourced languages and narrow domains.

[1]  Kiril Ivanov Simov,et al.  Making Monolingual Corpora Comparable: a Case Study of Bulgarian and Croatian , 2004, LREC.

[2]  Tomaz Erjavec,et al.  The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages , 2006, LREC.

[3]  Inguna Skadina,et al.  A Collection of Comparable Corpora for Under-resourced Languages , 2010, Baltic HLT.

[4]  Marc Dymetman,et al.  Learning Machine Translation , 2010 .

[5]  Belinda Maia What are comparable corpora , 2003 .

[6]  Kevin Knight,et al.  A Decoder for Syntax-based Statistical MT , 2002, ACL.

[7]  Noah A. Smith,et al.  The Web as a Parallel Corpus , 2003, CL.

[8]  Emmanuel Morin,et al.  An Effective Compositional Model for Lexical Alignment , 2008, IJCNLP.

[9]  Alexander M. Fraser,et al.  Improved Machine Translation Performance via Parallel Sentence Extraction from Comparable Corpora , 2004, NAACL.

[10]  Iñaki San Vicente,et al.  Automatic Extraction of Bilingual Terms from Comparable Corpora in a Popular Science Domain , 2008 .

[11]  Dan Tufis,et al.  Improved Lexical Alignment by Combining Multiple Reified Alignments , 2006, EACL.

[12]  Sérgio Matos,et al.  Corpógrafo V4 - Tools for Researchers and Teachers using Comparable Corpora , 2008 .

[13]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[14]  Christoph Tillmann,et al.  A Projection Extension Algorithm for Statistical Machine Translation , 2003, EMNLP.

[15]  Inguna Skadina,et al.  ACCURAT Toolkit for Multi-Level Alignment and Information Extraction from Comparable Corpora , 2012, ACL.

[16]  Inguna Skadina,et al.  Collecting and Using Comparable Corpora for Statistical Machine Translation , 2012, LREC.

[17]  Hans Uszkoreit,et al.  Hybrid machine translation architectures within and beyond the EuroMatrix project , 2008, EAMT.

[18]  Kyo Kageura,et al.  Bilingual Terminology Mining - Using Brain, not brawn comparable corpora , 2007, ACL.

[19]  Richard Xiao,et al.  Parallel and comparable corpora: What are they up to? , 2007 .

[20]  Philipp Koehn,et al.  462 Machine Translation Systems for Europe , 2009, MTSUMMIT.

[21]  Serge Sharoff Classifying Web corpora into domain and genre using automatic feature identification , 2007 .

[22]  Ralph Grishman,et al.  Discovering Relations among Named Entities from Large Corpora , 2004, ACL.

[23]  John Sinclair,et al.  Looking up : an account of the COBUILD Project in lexical computing and the development of the Collins COBUILD English Language Dictionary , 1987 .

[24]  Pascale Fung,et al.  A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora , 1995, ACL.

[25]  Dragos Stefan Munteanu,et al.  Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.

[26]  Hsinchun Chen,et al.  Combining probability models and web mining models: a framework for proper name transliteration , 2008, Inf. Technol. Manag..

[27]  Robert C. Moore Fast and accurate sentence alignment of bilingual corpora , 2002, AMTA.

[28]  Kevin Knight,et al.  A Syntax-based Statistical Translation Model , 2001, ACL.

[29]  Lei Shi,et al.  A DOM Tree Alignment Model for Mining Parallel Data from the Web , 2006, ACL.

[30]  Paul Rayson,et al.  Comparing Corpora using Frequency Profiling , 2000, Proceedings of the workshop on Comparing corpora -.

[31]  Pascale Pung,et al.  A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora , 1995, ACL 1995.

[32]  David Nadeau,et al.  Semi-supervised named entity recognition: learning to recognize 100 entity types with little supervision , 2007 .

[33]  Emmanuel Morin,et al.  French-English Terminology Extraction from Comparable Corpora , 2005, IJCNLP.

[34]  Dan Tufis,et al.  RACAI: Meaning Affinity Models , 2007, SemEval@ACL.

[35]  Evangelos Kanoulas,et al.  A light way to collect comparable corpora from the Web , 2012, LREC.

[36]  Angus Roberts,et al.  Combining Terminology Resources and Statistical Methods for Entity Recognition: an Evaluation , 2008, LREC.

[37]  Ying Zhang,et al.  Automatic Acquisition of Chinese-English Parallel Corpus from the Web , 2006, ECIR.

[38]  Alex Waibel,et al.  The CMU statistical machine translation system , 2003, MTSUMMIT.

[39]  Hitoshi Isahara,et al.  Reliable Measures for Aligning Japanese-English News Articles and Sentences , 2003, ACL.

[40]  Nerea Ezeiza,et al.  Translating Named Entities using Comparable Corpora , 2008 .

[41]  David Chiang,et al.  Hierarchical Phrase-Based Translation , 2007, CL.

[42]  Sabine Hunsicker,et al.  Hybrid Parallel Sentence Mining from Comparable Corpora , 2012, EAMT.

[43]  Inguna Skadina,et al.  English-Latvian SMT: knowledge or data? , 2009, NODALIDA.

[44]  linguatec Gottfried-Keller,et al.  Using corpus information to improve MT quality , 2006 .

[45]  Inderjeet Mani,et al.  Learning to Match Names Across Languages , 2013, Multi-source, Multilingual Information Extraction and Summarization.

[46]  Dan Tufis,et al.  Acquis Communautaire Sentence Alignment using Support Vector Machines , 2006, LREC.

[47]  Stephan Vogel,et al.  Adaptive parallel sentences mining from web bilingual news collection , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[48]  Robert J. Gaizauskas,et al.  Aligning Words in English-Hindi Parallel Corpora , 2005, ParallelText@ACL.

[49]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[50]  Grzegorz Kondrak,et al.  A New Algorithm for the Alignment of Phonetic Sequences , 2000, ANLP.

[51]  Andreas Eisele,et al.  Improving Machine Translation Performance Using Comparable Corpora , 2010 .

[52]  Marko Tadic,et al.  A generic method for multi word extraction from Wikipedia , 2008, ITI 2008 - 30th International Conference on Information Technology Interfaces.

[53]  Philipp Koehn,et al.  Factored Translation Models , 2007, EMNLP.

[54]  Pierre Zweigenbaum,et al.  Looking for Candidate Translational Equivalents in Specialized, Comparable Corpora , 2002, COLING.

[55]  A. Kilgarriff Comparing Corpora , 2001 .

[56]  Chris Quirk,et al.  Generative Models of Noisy Translations with Applications to Parallel Fragment Extraction , 2007 .

[57]  Reinhard Rapp,et al.  Identifying Word Translations in Non-Parallel Texts , 1995, ACL.

[58]  Kevin Knight,et al.  Syntactic Re-Alignment Models for Machine Translation , 2007, EMNLP.

[59]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.