Multilingual Anchoring: Interactive Topic Modeling and Alignment Across Languages

Multilingual topic models can reveal patterns in cross-lingual document collections. However, existing models lack speed and interactivity, which prevents adoption in everyday corpora exploration or quick moving situations (e.g., natural disasters, political instability). First, we propose a multilingual anchoring algorithm that builds an anchor-based topic model for documents in different languages. Then, we incorporate interactivity to develop MTAnchor (Multilingual Topic Anchors), a system that allows users to refine the topic model. We test our algorithms on labeled English, Chinese, and Sinhalese documents. Within minutes, our methods can produce interpretable topics that are useful for specific classification tasks.

[1]  Sanjeev Arora,et al.  Learning Topic Models -- Going beyond SVD , 2012, 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science.

[2]  David M. Blei,et al.  Multilingual Topic Models for Unaligned Text , 2009, UAI.

[3]  David M. Mimno,et al.  Applications of Topic Models , 2017, Found. Trends Inf. Retr..

[4]  Noah Constant,et al.  The pragmatics of expressive content: Evidence from large corpora , 2009 .

[5]  Mauro Cettolo,et al.  WIT3: Web Inventory of Transcribed and Translated Talks , 2012, EAMT.

[6]  Niklas Elmqvist,et al.  The human touch: How non-expert users perceive, interpret, and fix topic models , 2017, Int. J. Hum. Comput. Stud..

[7]  Benjamin Van Durme,et al.  Multiview LSA: Representation Learning via Generalized CCA , 2015, NAACL.

[8]  Andrew McCallum,et al.  Polylingual Topic Models , 2009, EMNLP.

[9]  XuanLong Nguyen,et al.  Conic Scan-and-Cover algorithms for nonparametric topic modeling , 2017, NIPS.

[10]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Sanjeev Arora,et al.  A Practical Algorithm for Topic Modeling with Provable Guarantees , 2012, ICML.

[12]  Hal Daumé,et al.  Extracting Multilingual Topics from Unaligned Comparable Corpora , 2010, ECIR.

[13]  Philip Resnik,et al.  Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation , 2010, EMNLP.

[14]  Wiebke Wagner,et al.  Steven Bird, Ewan Klein and Edward Loper: Natural Language Processing with Python, Analyzing Text with the Natural Language Toolkit , 2010, Lang. Resour. Evaluation.

[15]  Eric P. Xing,et al.  BiTAM: Bilingual Topic AdMixture Models for Word Alignment , 2006, ACL.

[16]  Jordan L. Boyd-Graber,et al.  Anchors Regularized: Adding Robustness and Extensibility to Scalable Topic-Modeling Algorithms , 2014, ACL.

[17]  Gerard de Melo,et al.  Detecting Cross-Cultural Differences Using a Multilingual Topic Model , 2016, TACL.

[18]  Jordan L. Boyd-Graber,et al.  Tandem Anchoring: a Multiword Anchor Approach for Interactive Topic Modeling , 2017, ACL.

[19]  Thang Nguyen,et al.  Is Your Anchor Going Up or Down? Fast and Accurate Supervised Topic Models , 2015, NAACL.

[20]  Timothy Baldwin,et al.  Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality , 2014, EACL.

[21]  Stephanie Strassel,et al.  LORELEI Language Packs: Data, Tools, and Resources for Technology Development in Low Resource Languages , 2016, LREC.

[22]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[23]  Jaegul Choo,et al.  UTOPIAN: User-Driven Topic Modeling Based on Interactive Nonnegative Matrix Factorization , 2013, IEEE Transactions on Visualization and Computer Graphics.

[24]  David Mimno,et al.  Low-dimensional Embeddings for Interpretable Anchor-based Topic Inference , 2014, EMNLP.

[25]  Patrick F. Reidy An Introduction to Latent Semantic Analysis , 2009 .

[26]  Jian Hu,et al.  Mining multilingual topics from wikipedia , 2009, WWW '09.

[27]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[28]  Vladimir Eidelman,et al.  Polylingual Tree-Based Topic Models for Translation Domain Adaptation , 2014, ACL.

[29]  Min Xiao,et al.  A Novel Two-Step Method for Cross Language Representation Learning , 2013, NIPS.

[30]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[31]  Quentin Pleple,et al.  Interactive Topic Modeling , 2013 .

[32]  Lidong Bing,et al.  Detecting Common Discussion Topics Across Culture From News Reader Comments , 2016, ACL.

[33]  Ann Bies,et al.  Situational Awareness for Low Resource Languages: the LORELEI Situation Frame Annotation Task , 2017, SMERP@ECIR.