Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections

Probabilistic topic modeling of text collections is a powerful tool for statistical text analysis based on the preferential use of graphical models and Bayesian learning. Additive regularization for topic modeling (ARTM) is a recent semiprobabilistic approach, which provides a simpler inference for many models previously studied only in the Bayesian settings. ARTM reduces barriers to entry into topic modeling research field and facilitates combination of topic models. In this paper we develop the multimodal extension of ARTM approach and implement it in BigARTM open source project for online parallelized topic modeling. We demonstrate the ability of non-Bayesian regularization to combine modalities, languages and multiple criteria to find sparse, diverse, and interpretable topics.

[1]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[2]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[3]  Jian Hu,et al.  Mining multilingual topics from wikipedia , 2009, WWW '09.

[4]  Chong Wang,et al.  Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process , 2009, NIPS.

[5]  Konstantin Vorontsov,et al.  Additive Regularization of Topic Models for Topic Selection and Sparse Factorization , 2015, SLDS.

[6]  David A. Cohn,et al.  The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity , 2000, NIPS.

[7]  Max Welling,et al.  Distributed Algorithms for Topic Models , 2009, J. Mach. Learn. Res..

[8]  Juan-Zi Li,et al.  Knowledge discovery through directed probabilistic topic models: a survey , 2010, Frontiers of Computer Science in China.

[9]  Padhraic Smyth,et al.  Statistical entity-topic models , 2006, KDD '06.

[10]  David M. Blei,et al.  Sparse stochastic inference for latent Dirichlet allocation , 2012, ICML.

[11]  Steffen Bickel,et al.  Unsupervised prediction of citation influences , 2007, ICML '07.

[12]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[13]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[14]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[15]  Konstantin Vorontsov,et al.  Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization , 2014, AIST.

[16]  Yifan Hu,et al.  Collaborative Filtering for Implicit Feedback Datasets , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[17]  Eric P. Xing,et al.  Sparse Additive Generative Models of Text , 2011, ICML.

[18]  Edward Y. Chang,et al.  PLDA: Parallel Latent Dirichlet Allocation for Large-Scale Applications , 2009, AAIM.

[19]  John C. Platt,et al.  Translingual Document Representations from Discriminative Projections , 2010, EMNLP.

[20]  Constantine Kotropoulos,et al.  Online PLSA: Batch Updating Techniques Including Out-of-Vocabulary Words , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[21]  David W. Corne,et al.  Multi-objective Topic Modeling , 2013, EMO.

[22]  Zhiyuan Liu,et al.  PLDA+: Parallel latent dirichlet allocation with data placement and pipeline processing , 2011, TIST.

[23]  Timothy N. Rubin,et al.  Statistical topic models for multi-label document classification , 2011, Machine Learning.

[24]  Jen-Tzung Chien,et al.  Bayesian Sparse Topic Model , 2013, Journal of Signal Processing Systems.

[25]  Andrew McCallum,et al.  Polylingual Topic Models , 2009, EMNLP.

[26]  Bhiksha Raj,et al.  Sparse Overcomplete Latent Variable Decomposition of Counts Data , 2007, NIPS.

[27]  Johan Ugander,et al.  A concave regularization technique for sparse mixture models , 2011, NIPS.

[28]  Marie-Francine Moens,et al.  Cross-language linking of news stories on the web using interlingual topic modelling , 2009, CIKM-SWSM.

[29]  Konstantin Vorontsov,et al.  Additive regularization for topic models of text collections , 2014, Doklady Mathematics.

[30]  Alexander J. Smola,et al.  An architecture for parallel topic models , 2010, Proc. VLDB Endow..

[31]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[32]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[33]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[34]  Maosong Sun,et al.  Tag-LDA for Scalable Real-time Tag Recommendation , 2009 .

[35]  Konstantin Vorontsov,et al.  Additive regularization of topic models , 2015, Machine Learning.