Conic Scan-and-Cover algorithms for nonparametric topic modeling

We propose new algorithms for topic modeling when the number of topics is unknown. Our approach relies on an analysis of the concentration of mass and angular geometry of the topic simplex, a convex polytope constructed by taking the convex hull of vertices representing the latent topics. Our algorithms are shown in practice to have accuracy comparable to a Gibbs sampler in terms of topic estimation, which requires the number of topics be given. Moreover, they are one of the fastest among several state of the art parametric techniques. Statistical consistency of our estimator is established under some conditions.

[1]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[2]  XuanLong Nguyen,et al.  Posterior contraction of the population polytope in finite admixture models , 2012, ArXiv.

[3]  Eric Jones,et al.  SciPy: Open Source Scientific Tools for Python , 2001 .

[4]  Qiaozhu Mei,et al.  Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis , 2014, ICML.

[5]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[6]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[7]  Pascal Poupart,et al.  Online Bayesian Moment Matching for Topic Modeling with Unknown Number of Topics , 2016, NIPS.

[8]  XuanLong Nguyen,et al.  Geometric Dirichlet Means Algorithm for topic inference , 2016, NIPS.

[9]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[10]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[11]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[12]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[13]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[14]  M. Stephens,et al.  Inference of population structure using multilocus genotype data: dominant markers and null alleles , 2007, Molecular ecology notes.

[15]  Anima Anandkumar,et al.  A Spectral Algorithm for Latent Dirichlet Allocation , 2012, Algorithmica.

[16]  Sanjeev Arora,et al.  A Practical Algorithm for Topic Modeling with Provable Guarantees , 2012, ICML.

[17]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[18]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..