Reproducible Extraction of Cross-lingual Topics (rectr)

ABSTRACT With global media content databases and online content being available, analyzing topical structures in different languages simultaneously has become an urgent computational task. Some previous studies have analyzed topics in a multilingual corpus by translating all items into a single language using a machine translation service, such as Google Translate. We argue that this method is not reproducible in the long run and proposes a new method – Reproducible Extraction of Cross-lingual Topics Using R (rectr). Our method utilizes open-source-aligned word embeddings to understand the cross-lingual meanings of words and has a mechanism to normalize residual influence from language differences. We present a benchmark that compares the topics extracted from a corpus of English, German, and French news using our method with methods used in the literature. We show that our method is not only reproducible but can also generate high-quality cross-lingual topics. We demonstrate how our method can be applied in tracking news topics across time and languages.

[1]  Andrew McCallum,et al.  Polylingual Topic Models , 2009, EMNLP.

[2]  Kohei Watanabe,et al.  Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches , 2020, Social Science Computer Review.

[3]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[4]  Michael Sedlmair,et al.  More than Bags of Words: Sentiment Analysis with Word Embeddings , 2018 .

[5]  Justin Grimmer,et al.  Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts , 2013, Political Analysis.

[6]  Jakob-Moritz Eberl,et al.  Computational Communication Science| When the Journey Is as Important as the Goal: A Roadmap to Multilingual Dictionary Construction , 2019 .

[7]  Arnold Stromberg,et al.  Why Write Statistical Software? The Case of Robust Statistical Methods , 2004 .

[8]  Haiyan Wang,et al.  quanteda: An R package for the quantitative analysis of textual data , 2018, J. Open Source Softw..

[9]  Damian Trilling,et al.  Taking Stock of the Toolkit , 2016, Rethinking Research Methods in an Age of Digital Journalism.

[10]  Hormuzd A Katki,et al.  Estimating the agreement and diagnostic accuracy of two diagnostic tests when one test is conducted on only a subsample of specimens , 2012, Statistics in medicine.

[11]  Michael J. Brusco,et al.  Examining the effect of initialization strategies on the performance of Gaussian mixture modeling , 2015, Behavior Research Methods.

[12]  Will Lowe,et al.  Multilingual Sentiment Analysis: A New Approach to Measuring Conflict in Legislative Speeches , 2018, Legislative Studies Quarterly.

[13]  Carina Jacobi,et al.  Quantitative analysis of large amounts of journalistic texts using topic modelling , 2016, Rethinking Research Methods in an Age of Digital Journalism.

[14]  Gérard Govaert,et al.  Rmixmod: The R Package of the Model-Based Unsupervised, Supervised and Semi-Supervised Classification Mixmod Library , 2015 .

[15]  Silke Adam,et al.  Applying LDA Topic Modeling in Communication Research: Toward a Valid and Reliable Methodology , 2018 .

[16]  Michael J. Paul,et al.  Zika discourse in the Americas: A multilingual topic analysis of Twitter , 2019, PloS one.

[17]  Margaret E. Roberts,et al.  Computer-Assisted Text Analysis for Comparative Politics , 2015, Political Analysis.

[18]  Yuri M. Zhukov,et al.  Media Ownership and News Coverage of International Conflict , 2018, Political Communication.

[19]  Xiaohui Yan,et al.  A biterm topic model for short texts , 2013, WWW.

[20]  Martijn Schoonvelde,et al.  No Longer Lost in Translation: Evidence that Google Translate Works for Comparative Bag-of-Words Text Applications , 2018, Political Analysis.

[21]  Wouter van Atteveldt,et al.  The Trouble with Sharing Your Privates: Pursuing Ethical Open Science and Collaborative Research across National Jurisdictions Using Sensitive Data , 2020 .

[22]  Sonia Livingstone,et al.  On the Challenges of Cross-National Comparative Media Research , 2003 .

[23]  Anton Nekrutenko,et al.  Ten Simple Rules for Reproducible Computational Research , 2013, PLoS Comput. Biol..

[24]  Ueli Reber,et al.  Overcoming Language Barriers: Assessing the Potential of Machine Translation and Topic Modeling for the Comparative Analysis of Multilingual Text Corpora , 2018, Communication Methods and Measures.

[25]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[26]  Damian Trilling,et al.  Computational Communication Science| Toward Open Computational Communication Science: A Practical Road Map for Reusable Data and Code , 2019 .

[27]  Zion Tsz Ho Tse,et al.  Twitter and Middle East respiratory syndrome, South Korea, 2015: A multi-lingual study , 2018, Infection, Disease & Health.

[28]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[29]  Olessia Koltsova,et al.  Mapping the public agenda with topic modeling: The case of the Russian livejournal , 2013 .