No Longer Lost in Translation: Evidence that Google Translate Works for Comparative Bag-of-Words Text Applications

Automated text analysis allows researchers to analyze large quantities of text. Yet, comparative researchers are presented with a big challenge: across countries people speak different languages. To address this issue, some analysts have suggested using Google Translate to convert all texts into English before starting the analysis (Lucas et al.  2015). But in doing so, do we get lost in translation? This paper evaluates the usefulness of machine translation for bag-of-words models—such as topic models. We use the europarl dataset and compare term-document matrices (TDMs) as well as topic model results from gold standard translated text and machine-translated text. We evaluate results at both the document and the corpus level. We first find TDMs for both text corpora to be highly similar, with minor differences across languages. What is more, we find considerable overlap in the set of features generated from human-translated and machine-translated texts. With regard to LDA topic models, we find topical prevalence and topical content to be highly similar with again only small differences across languages. We conclude that Google Translate is a useful tool for comparative researchers when using bag-of-words text models.

[1]  Rasoul Samad,et al.  The role of syntax and semantics in machine translation and quality estimation of machine-translated user-generated content , 2015 .

[2]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[3]  Sharid Loáiciga,et al.  English-French Verb Phrase Alignment in Europarl for Tense Translation Modeling , 2014, LREC.

[4]  Margaret E. Roberts,et al.  Computer-Assisted Text Analysis for Comparative Politics , 2015, Political Analysis.

[5]  Philipp Koehn,et al.  Manual and Automatic Evaluation of Machine Translation between European Languages , 2006, WMT@HLT-NAACL.

[6]  Margaret E. Roberts,et al.  A Model of Text for Experimentation in the Social Sciences , 2016 .

[7]  R. Gray,et al.  Language-tree divergence times support the Anatolian theory of Indo-European origin , 2003, Nature.

[8]  Justin Grimmer,et al.  Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts , 2013, Political Analysis.

[9]  Lucia Specia,et al.  Document-level translation quality estimation: exploring discourse and pseudo-references , 2014, EAMT.

[10]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[11]  Kenneth Benoit,et al.  quanteda: Quantitative Analysis of Textual Data (R package) , 2015 .

[12]  Owen Rambow,et al.  Sentiment Analysis of Twitter Data , 2011 .

[13]  Andrei Popescu-Belis,et al.  Discourse-level Annotation over Europarl for Machine Translation: Connectives and Pronouns , 2012, LREC.

[14]  Arthur Spirling,et al.  Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It , 2017, Political Analysis.

[15]  Arthur Spirling,et al.  Assessing the Consequences of Text Preprocessing Decisions , 2016 .

[16]  Stephen Hampshire,et al.  Translation and the Internet: Evaluating the Quality of Free Online Machine Translators , 2010 .

[17]  Yoav Goldberg,et al.  Automatic Detection of Machine Translated Text and Translation Quality Estimation , 2014, ACL.

[18]  Alexandra Balahur,et al.  Comparative experiments using supervised learning and machine translation for multilingual sentiment analysis , 2014, Comput. Speech Lang..

[19]  Kenneth Benoit,et al.  Estimating Intra-Party Preferences: Comparing Speeches to Votes* , 2015, Political Science Research and Methods.

[20]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[21]  Jennifer Foster,et al.  Syntax and Semantics in Quality Estimation of Machine Translation , 2014, SSST@EMNLP.

[22]  Kurt Hornik,et al.  topicmodels : An R Package for Fitting Topic Models , 2016 .

[23]  Haiyan Wang,et al.  quanteda: An R package for the quantitative analysis of textual data , 2018, J. Open Source Softw..

[24]  Jeffrey Heer,et al.  Topic Model Diagnostics: Assessing Domain Relevance via Topical Alignment , 2013, ICML.

[25]  Alta Van Rensburg,et al.  Translation technology explored: Has a three-year maturation period done Google Translate any good? , 2013 .

[26]  Zoltán Fazekas,et al.  The Nuts and Bolts of Automated Text Analysis. Comparing Different Document Pre-Processing Techniques in Four Countries , 2016 .