Extracting translation pairs from social network content

We introduce two methods to collect additional training data for statistical machine translation systems from public social network content. The first method identifies multilingual content where the author self-translated their own post to reach additional friends, fans or customers. Once identified, we can split the post in the language segments and extract translation pairs from this content. The second methods considers web links (URLs) that users add as part of their post to point the reader to a video, article or website. If the same URL is shared from different language users, there is a chance they might give the same comment in their respective language. We use a support vector machine (SVM) as a classifier to identify true translations from all candidate pairs. We collected additional translation pairs using both methods for the language pairs Spanish-English and Portuguese-English. Testing the collected data as additional training data for statistical machine translations on in-domain test sets resulted in very significant improvements of up to 5 BLEU.

[1]  Ying Zhang,et al.  Automatic Acquisition of Chinese-English Parallel Corpus from the Web , 2006, ECIR.

[2]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[3]  Dragos Stefan Munteanu,et al.  Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.

[4]  Andreas Stolcke,et al.  SRILM at Sixteen: Update and Outlook , 2011 .

[5]  Noah A. Smith,et al.  The Web as a Parallel Corpus , 2003, CL.

[6]  Takashi Chikayama,et al.  A Fast and Accurate Method for Detecting English-Japanese Parallel Texts , 2006 .

[7]  Stephan Vogel,et al.  Parallel Implementations of Word Alignment Tool , 2008, SETQALNLP.

[8]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[9]  Jakob Uszkoreit,et al.  Large Scale Parallel Document Mining for Machine Translation , 2010, COLING.

[10]  Alexander H. Waibel,et al.  Low Cost Portability for Statistical Machine Translation based on N-gram Frequency and TF-IDF , 2005, IWSLT.

[11]  Wang Ling,et al.  Microblogs as Parallel Corpora , 2013, ACL.

[12]  Alexander H. Waibel,et al.  The Karlsruhe Institute of Technology Translation Systems for the WMT 2013 , 2012, WMT@NAACL-HLT.

[13]  Wang Ling,et al.  Crowdsourcing High-Quality Parallel Data Extraction from Twitter , 2014, WMT@ACL.

[14]  Kristina Toutanova,et al.  Extracting Parallel Sentences from Comparable Corpora using Document Level Alignment , 2010, NAACL.

[15]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[16]  Jimmy J. Lin,et al.  Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling , 2012, NAACL.

[17]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.