Sparse online topic models

Topic models have shown great promise in discovering latent semantic structures from complex data corpora, ranging from text documents and web news articles to images, videos, and even biological data. In order to deal with massive data collections and dynamic text streams, probabilistic online topic models such as online latent Dirichlet allocation (OLDA) have recently been developed. However, due to normalization constraints, OLDA can be ineffective in controlling the sparsity of discovered representations, a desirable property for learning interpretable semantic patterns, especially when the total number of topics is large. In contrast, sparse topical coding (STC) has been successfully introduced as a non-probabilistic topic model for effectively discovering sparse latent patterns by using sparsity-inducing regularization. But, unfortunately STC cannot scale to very large datasets or deal with online text streams, partly due to its batch learning procedure. In this paper, we present a sparse online topic model, which directly controls the sparsity of latent semantic patterns by imposing sparsity-inducing regularization and learns the topical dictionary by an online algorithm. The online algorithm is efficient and guaranteed to converge. Extensive empirical results of the sparse online topic model as well as its collapsed and supervised extensions on a large-scale Wikipedia dataset and the medium-sized 20Newsgroups dataset demonstrate appealing performance.

[1]  John D. Lafferty,et al.  Correlated Topic Models , 2005, NIPS.

[2]  Alexei A. Efros,et al.  Using Multiple Segmentations to Discover Objects and their Extent in Image Collections , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[3]  David J. Field,et al.  Emergence of simple-cell receptive field properties by learning a sparse code for natural images , 1996, Nature.

[4]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[5]  Eric P. Xing,et al.  MedLDA: maximum margin supervised topic models for regression and classification , 2009, ICML '09.

[6]  Yasushi Sakurai,et al.  Online multiscale dynamic topic models , 2010, KDD.

[7]  Andrew McCallum,et al.  Polylingual Topic Models , 2009, EMNLP.

[8]  Michael I. Jordan,et al.  Learning Multiscale Representations of Natural Scenes Using Dirichlet Processes , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[9]  Bhiksha Raj,et al.  Sparse Overcomplete Latent Variable Decomposition of Counts Data , 2007, NIPS.

[10]  Pietro Perona,et al.  Learning object categories from Google's image search , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[11]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[12]  B. Schölkopf,et al.  Non-monotonic Poisson Likelihood Maximization , 2008 .

[13]  Jordan L. Boyd-Graber,et al.  Mr. LDA: a flexible large scale topic modeling package using variational inference in MapReduce , 2012, WWW.

[14]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[15]  Aapo Hyvärinen,et al.  Sparse Code Shrinkage: Denoising of Nongaussian Data by Maximum Likelihood Estimation , 1999, Neural Computation.

[16]  Andrew McCallum,et al.  Efficient methods for topic model inference on streaming document collections , 2009, KDD.

[17]  Patrik O. Hoyer,et al.  Non-negative sparse coding , 2002, Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing.

[18]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[19]  David M. Blei,et al.  Sparse stochastic inference for latent Dirichlet allocation , 2012, ICML.

[20]  Guillermo Sapiro,et al.  Supervised Dictionary Learning , 2008, NIPS.

[21]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[22]  Eric P. Xing,et al.  Sparse Topical Coding , 2011, UAI.

[23]  Chong Wang,et al.  Simultaneous image classification and annotation , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[25]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[26]  Wen Gao,et al.  Towards compact topical descriptors , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[28]  Alexei A. Efros,et al.  Discovering objects and their location in images , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[29]  Xiaojin Zhu,et al.  A Topic Model for Word Sense Disambiguation , 2007, EMNLP.

[30]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[31]  Quan Wang,et al.  Regularized latent semantic indexing , 2011, SIGIR.

[32]  Yoram Singer,et al.  Efficient projections onto the l1-ball for learning in high dimensions , 2008, ICML '08.

[33]  Thomas Hofmann,et al.  A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation , 2007 .

[34]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[35]  Edoardo M. Airoldi,et al.  Mixed Membership Stochastic Blockmodels , 2007, NIPS.

[36]  N. Stanietsky,et al.  The interaction of TIGIT with PVR and PVRL2 inhibits human NK cell cytotoxicity , 2009, Proceedings of the National Academy of Sciences.

[37]  Léon Bottou,et al.  On-line learning and stochastic approximations , 1999 .

[38]  Yee Whye Teh,et al.  On Smoothing and Inference for Topic Models , 2009, UAI.

[39]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[40]  Hanqing Lu,et al.  Learning Semantic Motion Patterns for Dynamic Scenes by Improved Sparse Topical Coding , 2012, 2012 IEEE International Conference on Multimedia and Expo.

[41]  Hao Su,et al.  Multi-Level Structured Image Coding on High-Dimensional Image Representation , 2012, ACCV.

[42]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.