Mining Parallel Texts from Mixed-Language Web Pages

We propose to mine parallel texts from mixedlanguage web pages. We define a mixedlanguage web page as a web page consisting of (at least) two languages. We mined Japanese-English parallel texts from mixedlanguage web pages. We presented the statistics for extracted parallel texts and conducted machine translation experiments. These statistics and experiments showed that mixedlanguage web pages are rich sources of parallel texts.

[1]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[2]  Tomaz Erjavec,et al.  The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages , 2006, LREC.

[3]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[4]  Pascale Fung,et al.  Mining Very-Non-Parallel Corpora: Parallel Sentence and Lexicon Extraction via Bootstrapping and E , 2004, EMNLP.

[5]  Cameron S. Fordyce,et al.  Overview of the IWSLT 2007 evaluation campaign , 2007, IWSLT.

[6]  Hitoshi Isahara,et al.  Reliable Measures for Aligning Japanese-English News Articles and Sentences , 2003, ACL.

[7]  Dragos Stefan Munteanu,et al.  Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.

[8]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[9]  Roland Kuhn,et al.  Mixture-Model Adaptation for SMT , 2007, WMT@ACL.

[10]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[11]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[12]  Christopher Cieri,et al.  Corpus Support for Machine Translation at LDC , 2006, LREC.

[13]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[14]  Lei Shi,et al.  A DOM Tree Alignment Model for Mining Parallel Data from the Web , 2006, ACL.

[15]  Noah A. Smith,et al.  The Web as a Parallel Corpus , 2003, CL.

[16]  Yuji Matsumoto,et al.  Bilingual Text, Matching using Bilingual Dictionary and Statistics , 1994, COLING.

[17]  H. Somers,et al.  A Framework of a Mechanical Translation between Japanese and English by Analogy Principle , 2003 .

[18]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[19]  Daisuke Kawahara,et al.  Case Frame Compilation from the Web using High-Performance Computing , 2006, LREC.

[20]  Xiaoyi Ma,et al.  BITS: a method for bilingual text search over the Web , 1999, MTSUMMIT.

[21]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[22]  Stephan Vogel,et al.  Adaptive parallel sentences mining from web bilingual news collection , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[23]  Eiichiro Sumita,et al.  Dynamic Model Interpolation for Statistical Machine Translation , 2008, WMT@ACL.

[24]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[25]  M. Utiyama,et al.  A Japanese-English patent parallel corpus , 2007, MTSUMMIT.