A Multilingual Topic Model for Learning Weighted Topic Links Across Corpora with Low Comparability

Multilingual topic models (MTMs) learn topics on documents in multiple languages. Past models align topics across languages by implicitly assuming the documents in different languages are highly comparable, often a false assumption. We introduce a new model that does not rely on this assumption, particularly useful in important low-resource language scenarios. Our MTM learns weighted topic links and connects cross-lingual topics only when the dominant words defining them are similar, outperforming LDA and previous MTMs in classification tasks using documents’ topic posteriors as features. It also learns coherent topics on documents with low comparability.

[1]  Stephanie Strassel,et al.  LORELEI Language Packs: Data, Tools, and Resources for Technology Development in Low Resource Languages , 2016, LREC.

[2]  Lidong Bing,et al.  Detecting Common Discussion Topics Across Culture From News Reader Comments , 2016, ACL.

[3]  Ann Bies,et al.  Situational Awareness for Low Resource Languages: the LORELEI Situation Frame Annotation Task , 2017, SMERP@ECIR.

[4]  Xiaojin Zhu,et al.  Incorporating domain knowledge into topic modeling via Dirichlet Forest priors , 2009, ICML '09.

[5]  Timothy Baldwin,et al.  Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality , 2014, EACL.

[6]  Andrew McCallum,et al.  Polylingual Topic Models , 2009, EMNLP.

[7]  Jian Hu,et al.  Mining multilingual topics from wikipedia , 2009, WWW '09.

[8]  Gerard de Melo,et al.  Detecting Cross-Cultural Differences Using a Multilingual Topic Model , 2016, TACL.

[9]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Pascale Fung,et al.  A statistical view on bilingual lexicon extraction , 1998, AMTA.

[11]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[12]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[13]  Shudong Hao,et al.  Learning Multilingual Topics from Incomparable Corpora , 2018, COLING.

[14]  Benjamin Van Durme,et al.  Multilingual Anchoring: Interactive Topic Modeling and Alignment Across Languages , 2018, NeurIPS.

[15]  Doug Downey,et al.  Efficient Methods for Incorporating Knowledge into Topic Models , 2015, EMNLP.

[16]  Vladimir Eidelman,et al.  Polylingual Tree-Based Topic Models for Translation Domain Adaptation , 2014, ACL.