Anchor-Free Correlated Topic Modeling: Identifiability and Algorithm

In topic modeling, many algorithms that guarantee identifiability of the topics have been developed under the premise that there exist anchor words -- i.e., words that only appear (with positive probability) in one topic. Follow-up work has resorted to three or higher-order statistics of the data corpus to relax the anchor word assumption. Reliable estimates of higher-order statistics are hard to obtain, however, and the identification of topics under those models hinges on uncorrelatedness of the topics, which can be unrealistic. This paper revisits topic modeling based on second-order moments, and proposes an anchor-free topic mining framework. The proposed approach guarantees the identification of the topics under a much milder condition compared to the anchor-word assumption, thereby exhibiting much better robustness in practice. The associated algorithm only involves one eigen-decomposition and a few small linear programs. This makes it easy to implement and scale up to very large problem instances. Experiments using the TDT2 and Reuters-21578 corpus demonstrate that the proposed anchor-free approach exhibits very favorable performance (measured using coherence, similarity count, and clustering accuracy metrics) compared to the prior art.

[1]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[2]  Victoria Stodden,et al.  When Does Non-Negative Matrix Factorization Give a Correct Decomposition into Parts? , 2003, NIPS.

[3]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[4]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[5]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[6]  Chong-Yung Chi,et al.  Convex analysis for non-negative blind source separation with application in imaging , 2010, Convex Optimization in Signal Processing and Communications.

[7]  Jiawei Han,et al.  Locally Consistent Concept Factorization for Document Clustering , 2011, IEEE Transactions on Knowledge and Data Engineering.

[8]  Sanjeev Arora,et al.  Learning Topic Models -- Going beyond SVD , 2012, 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science.

[9]  Joel A. Tropp,et al.  Factoring nonnegative matrices with linear programs , 2012, NIPS.

[10]  Sanjeev Arora,et al.  Computing a nonnegative matrix factorization -- provably , 2011, STOC '12.

[11]  Anima Anandkumar,et al.  Two SVDs Suffice: Spectral decompositions for probabilistic topic modeling and latent Dirichlet allocation , 2012, NIPS 2012.

[12]  Sanjeev Arora,et al.  A Practical Algorithm for Topic Modeling with Provable Guarantees , 2012, ICML.

[13]  Vikas Sindhwani,et al.  Fast Conical Hull Algorithms for Near-separable Non-negative Matrix Factorization , 2012, ICML.

[14]  Nicolas Gillis,et al.  Robustness Analysis of Hottopixx, a Linear Programming Model for Factoring Nonnegative Matrices , 2012, SIAM J. Matrix Anal. Appl..

[15]  Nicolas Gillis,et al.  Successive Nonnegative Projection Algorithm for Robust Nonnegative Blind Source Separation , 2013, SIAM J. Imaging Sci..

[16]  Nicolas Gillis,et al.  Fast and Robust Recursive Algorithmsfor Separable Nonnegative Matrix Factorization , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Anima Anandkumar,et al.  A Spectral Algorithm for Latent Dirichlet Allocation , 2012, Algorithmica.

[18]  Nikos D. Sidiropoulos,et al.  Non-Negative Matrix Factorization Revisited: Uniqueness and Algorithm for Symmetric Decomposition , 2014, IEEE Transactions on Signal Processing.

[19]  Nikos D. Sidiropoulos,et al.  Blind Separation of Quasi-Stationary Sources: Exploiting Convex Geometry in Covariance Domain , 2015, IEEE Transactions on Signal Processing.

[20]  Anima Anandkumar,et al.  When are overcomplete topic models identifiable? uniqueness of tensor tucker decompositions with structured sparsity , 2013, J. Mach. Learn. Res..

[21]  Nikos D. Sidiropoulos,et al.  Principled Neuro-Functional Connectivity Discovery , 2015, SDM.