Harvesting Parallel Text in Multiple Languages with Limited Supervision

The Web is an ever increasing, dynamically changing, multilingual repository of text. There have been several approaches to harvest this repository for bootstrapping, supplementing and adapting data needed for training models in speech and language applications. In this paper, we present semi-supervised and unsupervised approaches to harvesting multilingual text that rely on a key observation of link collocation. We demonstrate the eectiveness of our approach in the context of statistical machine translation by harvesting parallel texts and training translation models in 20 dierent languages. Furthermore, by exploiting the DOM trees of parallel webpages, we extend our harvesting technique to create parallel data for resource limited languages in an unsupervised manner. We also present some interesting observations concerning the socio-economic factors that the multilingual Web reflects.

[1]  Daniel Pimienta Twelve years of measuring linguistic diversity in the Internet: balance and perspectives , 2009 .

[2]  Chris Callison-Burch,et al.  Open Source Toolkit for Statistical Machine Translation: Factored Translation Models and Lattice Decoding , 2006 .

[3]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[4]  Masao Utiyama,et al.  Mining Parallel Texts from Mixed-Language Web Pages , 2009, MTSUMMIT.

[5]  Jakob Uszkoreit,et al.  Large Scale Parallel Document Mining for Machine Translation , 2010, COLING.

[6]  Jian-Yun Nie,et al.  Parallel Web text mining for cross-language IR , 2000, RIAO.

[7]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[8]  Nikolaus Augsten,et al.  RTED: A Robust Algorithm for the Tree Edit Distance , 2011, Proc. VLDB Endow..

[9]  Srinivas Bangalore,et al.  A Scalable Approach to Building a Parallel Corpus from the Web , 2011, INTERSPEECH.

[10]  Dragos Stefan Munteanu,et al.  Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.

[11]  Srinivas Bangalore,et al.  Crawling Back and Forth: Using Back and Out Links to Locate Bilingual Sites , 2011, IJCNLP.

[12]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[13]  Noah A. Smith,et al.  The Web as a Parallel Corpus , 2003, CL.

[14]  José João Almeida,et al.  Automatic parallel corpora and bilingual terminology extraction from parallel WebSites , 2010 .

[15]  Pascale Fung,et al.  Multi-level Bootstrapping For Extracting Parallel Sentences From a Quasi-Comparable Corpus , 2004, COLING.

[16]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[17]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[18]  Hae-Chang Rim,et al.  An Empirical Study on Web Mining of Parallel Data , 2010, COLING.

[19]  Lei Shi,et al.  A DOM Tree Alignment Model for Mining Parallel Data from the Web , 2006, ACL.

[20]  Xiaoyi Ma,et al.  Champollion: A Robust Parallel Text Sentence Aligner , 2006, LREC.