Symmetric Correspondence Topic Models for Multilingual Text Analysis

Topic modeling is a widely used approach to analyzing large text collections. A small number of multilingual topic models have recently been explored to discover latent topics among parallel or comparable documents, such as in Wikipedia. Other topic models that were originally proposed for structured data are also applicable to multilingual documents. Correspondence Latent Dirichlet Allocation (CorrLDA) is one such model; however, it requires a pivot language to be specified in advance. We propose a new topic model, Symmetric Correspondence LDA (SymCorrLDA), that incorporates a hidden variable to control a pivot language, in an extension of CorrLDA. We experimented with two multilingual comparable datasets extracted from Wikipedia and demonstrate that SymCorrLDA is more effective than some other existing multilingual topic models.

[1]  Jian Hu,et al.  Mining multilingual topics from wikipedia , 2009, WWW '09.

[2]  Ruslan Salakhutdinov,et al.  Evaluation methods for topic models , 2009, ICML '09.

[3]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[4]  Padhraic Smyth,et al.  Statistical entity-topic models , 2006, KDD '06.

[5]  W. Bruce Croft,et al.  The INQUERY Retrieval System , 1992, DEXA.

[6]  Hal Daumé,et al.  Extracting Multilingual Topics from Unaligned Comparable Corpora , 2010, ECIR.

[7]  Danielle S. McNamara,et al.  Handbook of latent semantic analysis , 2007 .

[8]  Eric P. Xing,et al.  BiTAM: Bilingual Topic AdMixture Models for Word Alignment , 2006, ACL.

[9]  Mirella Lapata,et al.  Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics , 1999, ACL 1999.

[10]  ChengXiang Zhai,et al.  Cross-Lingual Latent Topic Extraction , 2010, ACL.

[11]  Andrew McCallum,et al.  Polylingual Topic Models , 2009, EMNLP.

[12]  Jacques Savoy,et al.  Report on CLEF-2002 Experiments: Combining Multiple Sources of Evidence , 2002, CLEF.

[13]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[14]  J. Lafferty,et al.  Mixed-membership models of scientific publications , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[15]  David M. Blei,et al.  Multilingual Topic Models for Unaligned Text , 2009, UAI.

[16]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[17]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.