Latent dirichlet allocation: stability and applications to studies of user-generated content

Topic modeling, in particular the Latent Dirichlet Allocation (LDA) model, has recently emerged as an important tool for understanding large datasets, in particular, user-generated datasets in social studies of the Web. In this work, we investigate the instability of LDA inference, propose a new metric of similarity between topics and a criterion of vocabulary reduction. We show the limitations of the LDA approach for the purposes of qualitative analysis in social science and sketch some ways for improvement.

[1]  A. N. Tikhonov,et al.  Solutions of ill-posed problems , 1977 .

[2]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[3]  Konstantin Vorontsov,et al.  Additive regularization of topic models , 2015, Machine Learning.

[4]  David M. Blei,et al.  Introduction to Probabilistic Topic Models , 2010 .

[5]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Olessia Koltsova,et al.  Mapping the public agenda with topic modeling: The case of the Russian livejournal , 2013 .

[7]  Hanna Wallach,et al.  Structured Topic Models for Language , 2008 .

[8]  Stan Z. Li Markov Random Field Modeling in Image Analysis , 2009, Advances in Pattern Recognition.

[9]  Thomas L. Griffiths,et al.  Probabilistic Topic Models , 2007 .

[10]  Ruslan Salakhutdinov,et al.  Evaluation methods for topic models , 2009, ICML '09.

[11]  Konstantin Vorontsov,et al.  Robust PLSA Performs Better Than LDA , 2013, ECIR.

[12]  R. Casey,et al.  Advances in Pattern Recognition , 1971 .

[13]  Etienne Barnard,et al.  Evaluating topic models with stability , 2008 .

[14]  John D. Lafferty,et al.  Correlated Topic Models , 2005, NIPS.

[15]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[16]  Juan-Zi Li,et al.  Knowledge discovery through directed probabilistic topic models: a survey , 2010, Frontiers of Computer Science in China.

[17]  H. Kuhn The Hungarian method for the assignment problem , 1955 .