Mining Parallel Corpora from Sina Weibo and Twitter

Microblogs such as Twitter, Facebook, and Sina Weibo (China's equivalent of Twitter) are a remarkable linguistic resource. In contrast to content from edited genres such as newswire, microblogs contain discussions of virtually every topic by numerous individuals in different languages and dialects and in different styles. In this work, we show that some microblog users post “self-translated” messages targeting audiences who speak different languages, either by writing the same message in multiple languages or by retweeting translations of their original posts in a second language. We introduce a method for finding and extracting this naturally occurring parallel data. Identifying the parallel content requires solving an alignment problem, and we give an optimally efficient dynamic programming algorithm for this. Using our method, we extract nearly 3M Chinese–English parallel segments from Sina Weibo using a targeted crawl of Weibo users who post in multiple languages. Additionally, from a random sample of Twitter, we obtain substantial amounts of parallel data in multiple language pairs. Evaluation is performed by assessing the accuracy of our extraction approach relative to a manual annotation as well as in terms of utility as training data for a Chinese–English machine translation system. Relative to traditional parallel data resources, the automatically extracted parallel data yield substantial translation quality improvements in translating microblog text and modest improvements in translating edited news content.

[1]  Timothy Baldwin,et al.  Automatic Detection and Language Identification of Multilingual Documents , 2014, TACL.

[2]  Fabienne Braune,et al.  Improved Unsupervised Sentence Alignment for Symmetrical and Asymmetrical Parallel Corpora , 2010, COLING.

[3]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[4]  Chris Callison-Burch,et al.  Crowdsourcing Translation: Professional Quality from Non-Professionals , 2011, ACL.

[5]  Stefan Riezler,et al.  Twitter Translation using Translation-Based Cross-Lingual Retrieval , 2012, WMT@NAACL-HLT.

[6]  Benjamin Van Durme,et al.  Mining Parenthetical Translations from the Web by Word Alignment , 2008, ACL.

[7]  Thomas Gottron,et al.  A Comparison of Language Identification Approaches on Short, Query-Style Texts , 2010, ECIR.

[8]  Alexander M. Fraser,et al.  Improved Machine Translation Performance via Parallel Sentence Extraction from Comparable Corpora , 2004, NAACL.

[9]  Kristina Toutanova,et al.  Extracting Parallel Sentences from Comparable Corpora using Document Level Alignment , 2010, NAACL.

[10]  Jakob Uszkoreit,et al.  Large Scale Parallel Document Mining for Machine Translation , 2010, COLING.

[11]  Matthias Eck,et al.  Extracting translation pairs from social network content , 2014, IWSLT.

[12]  Timothy Baldwin,et al.  Lexical Normalisation of Short Text Messages: Makn Sens a #twitter , 2011, ACL.

[13]  Noah A. Smith,et al.  The Web as a Parallel Corpus , 2003, CL.

[14]  Noah A. Smith,et al.  A Dependency Parser for Tweets , 2014, EMNLP.

[15]  Wang Ling,et al.  Microblogs as Parallel Corpora , 2013, ACL.

[16]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[17]  Marc A. Zissman,et al.  Comparison of : Four Approaches to Automatic Language Identification of Telephone Speech , 2004 .

[18]  Wang Ling,et al.  Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation , 2015, EMNLP.

[19]  Oren Etzioni,et al.  Open domain event extraction from twitter , 2012, KDD.

[20]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[21]  Wang Ling,et al.  Crowdsourcing High-Quality Parallel Data Extraction from Twitter , 2014, WMT@ACL.

[22]  Danah Boyd,et al.  I tweet honestly, I tweet passionately: Twitter users, context collapse, and the imagined audience , 2011, New Media Soc..

[23]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[24]  Jinxi Xu,et al.  Evaluating a probabilistic model for cross-lingual information retrieval , 2001, SIGIR '01.

[25]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[26]  Noah A. Smith,et al.  A Simple, Fast, and Effective Reparameterization of IBM Model 2 , 2013, NAACL.

[27]  Matt Post,et al.  Constructing Parallel Corpora for Six Indian Languages via Crowdsourcing , 2012, WMT@NAACL-HLT.

[28]  Simon J. Greenhill Levenshtein Distances Fail to Identify Language Relationships Accurately , 2011, CL.

[29]  Brendan T. O'Connor,et al.  Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters , 2013, NAACL.

[30]  Dragos Stefan Munteanu,et al.  Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.

[31]  Wang Ling,et al.  Paraphrasing 4 Microblog Normalization , 2013, EMNLP.

[32]  Bo Li,et al.  Mining Chinese-English Parallel Corpora from the Web , 2008, IJCNLP.

[33]  Maria Leonor Pacheco,et al.  of the Association for Computational Linguistics: , 2001 .

[34]  Brendan T. O'Connor,et al.  Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments , 2010, ACL.

[35]  Jaime G. Carbonell,et al.  Collaborative workflow for crowdsourcing translation , 2012, CSCW.

[36]  Takashi Chikayama,et al.  A Fast and Accurate Method for Detecting English-Japanese Parallel Texts , 2006 .

[37]  Hermann Ney,et al.  Sentence segmentation using IBM word alignment model 1 , 2005, EAMT.

[38]  Joel D. Martin,et al.  Improving Translation Quality by Discarding Most of the Phrasetable , 2007, EMNLP.

[39]  Stephan Vogel,et al.  Can Crowds Build parallel corpora for Machine Translation Systems? , 2010, Mturk@HLT-NAACL.

[40]  Theresa Wilson,et al.  Language Identification for Creating Language-Specific Twitter Collections , 2012 .

[41]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[42]  Nanyun Peng,et al.  Learning Polylingual Topic Models from Code-Switched Social Media Documents , 2014, ACL.

[43]  Jimmy J. Lin,et al.  Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling , 2012, NAACL.

[44]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[45]  William Yang Wang,et al.  Dependency Parsing for Weibo: An Efficient Probabilistic Logic Programming Approach , 2014, EMNLP.