Geometric Dirichlet Means Algorithm for topic inference

We propose a geometric algorithm for topic learning and inference that is built on the convex geometry of topics arising from the Latent Dirichlet Allocation (LDA) model and its nonparametric extensions. To this end we study the optimization of a geometric loss function, which is a surrogate to the LDA's likelihood. Our method involves a fast optimization based weighted clustering procedure augmented with geometric corrections, which overcomes the computational and statistical inefficiencies encountered by other techniques based on Gibbs sampling and variational inference, while achieving the accuracy comparable to that of a Gibbs sampler. The topic estimates produced by our method are shown to be statistically consistent under some conditions. The algorithm is evaluated with extensive experiments on simulated and real data.

[1]  John D. Lafferty,et al.  Correlated Topic Models , 2005, NIPS.

[2]  Anima Anandkumar,et al.  A Spectral Algorithm for Latent Dirichlet Allocation , 2012, Algorithmica.

[3]  Qiaozhu Mei,et al.  Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis , 2014, ICML.

[4]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[5]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[6]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[7]  XuanLong Nguyen,et al.  Posterior contraction of the population polytope in finite admixture models , 2012, ArXiv.

[8]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[9]  Thomas M. Cover,et al.  Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing) , 2006 .

[10]  Robert E. Mahony,et al.  The geometry of weighted low-rank approximations , 2003, IEEE Trans. Signal Process..

[11]  Sanjeev Arora,et al.  A Practical Algorithm for Topic Modeling with Provable Guarantees , 2012, ICML.

[12]  References , 1971 .

[13]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[14]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[15]  Qiang Du,et al.  Centroidal Voronoi Tessellations: Applications and Algorithms , 1999, SIAM Rev..

[16]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[17]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[18]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[19]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[20]  Abdolreza Sayyareh,et al.  A New Upper Bound for Kullback-Leibler Divergence , 2011 .

[21]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[22]  Michael I. Jordan,et al.  Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models , 2012, NIPS.

[23]  Chris H. Q. Ding,et al.  K-means clustering via principal component analysis , 2004, ICML.

[24]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[25]  M. Stephens,et al.  Inference of population structure using multilocus genotype data: dominant markers and null alleles , 2007, Molecular ecology notes.

[26]  D. Pollard Strong Consistency of $K$-Means Clustering , 1981 .

[27]  Chris H. Q. Ding,et al.  Nonnegative Matrix Factorization and Probabilistic Latent Semantic Indexing: Equivalence Chi-Square Statistic, and a Hybrid Method , 2006, AAAI.

[28]  Michael I. Jordan,et al.  Revisiting k-means: New Algorithms via Bayesian Nonparametrics , 2011, ICML.

[29]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..