Analyzing the Influence of Hyper-parameters and Regularizers of Topic Modeling in Terms of Renyi Entropy

Topic modeling is a popular technique for clustering large collections of text documents. A variety of different types of regularization is implemented in topic modeling. In this paper, we propose a novel approach for analyzing the influence of different regularization types on results of topic modeling. Based on Renyi entropy, this approach is inspired by the concepts from statistical physics, where an inferred topical structure of a collection can be considered an information statistical system residing in a non-equilibrium state. By testing our approach on four models—Probabilistic Latent Semantic Analysis (pLSA), Additive Regularization of Topic Models (BigARTM), Latent Dirichlet Allocation (LDA) with Gibbs sampling, LDA with variational inference (VLDA)—we, first of all, show that the minimum of Renyi entropy coincides with the “true” number of topics, as determined in two labelled collections. Simultaneously, we find that Hierarchical Dirichlet Process (HDP) model as a well-known approach for topic number optimization fails to detect such optimum. Next, we demonstrate that large values of the regularization coefficient in BigARTM significantly shift the minimum of entropy from the topic number optimum, which effect is not observed for hyper-parameters in LDA with Gibbs sampling. We conclude that regularization may introduce unpredictable distortions into topic models that need further research.

[1]  Alan L. Porter,et al.  Clustering scientific documents with topic modeling , 2014, Scientometrics.

[2]  Chong Wang,et al.  Online Variational Inference for the Hierarchical Dirichlet Process , 2011, AISTATS.

[3]  Andrew McCallum,et al.  Rethinking LDA: Why Priors Matter , 2009, NIPS.

[4]  Sergey I. Nikolenko,et al.  Stable Topic Modeling with Local Density Regularization , 2016, INSCI.

[5]  Rose,et al.  Statistical mechanics and phase transitions in clustering. , 1990, Physical review letters.

[6]  Chong Wang,et al.  Truncation-free Online Variational Inference for Bayesian Nonparametric Models , 2012, NIPS.

[7]  William W. Cohen,et al.  From Topic Models to Semi-supervised Learning: Biasing Mixed-Membership Models to Exploit Topic-Indicative Features in Entity Clustering , 2013, ECML/PKDD.

[8]  Sergey I. Nikolenko,et al.  Latent dirichlet allocation: stability and applications to studies of user-generated content , 2014, WebSci '14.

[9]  Weizhong Zhao,et al.  A heuristic approach to determine an appropriate number of topics in topic modeling , 2015, BMC Bioinformatics.

[10]  C. Nelson,et al.  Nuclear detection using Higher-Order topic modeling , 2012, 2012 IEEE Conference on Technologies for Homeland Security (HST).

[11]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Thierry Mora,et al.  Rényi entropy, abundance distribution, and the equivalence of ensembles. , 2016, Physical review. E.

[13]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[14]  Sergei Koltcov,et al.  Application of Rényi and Tsallis entropies to topic modeling optimization , 2018, Physica A: Statistical Mechanics and its Applications.

[15]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Michael J. Berry,et al.  Thermodynamics and signatures of criticality in a network of neurons , 2015, Proceedings of the National Academy of Sciences.

[17]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[18]  Konstantin Vorontsov,et al.  Additive regularization for topic models of text collections , 2014, Doklady Mathematics.

[19]  Joshua Goodman,et al.  A bit of progress in language modeling , 2001, Comput. Speech Lang..

[20]  Abubakr Gafar Abdalla,et al.  Probability Theory , 2017, Encyclopedia of GIS.

[21]  Gregor Heinrich Parameter estimation for text analysis , 2009 .

[22]  Thomas L. Griffiths,et al.  Learning author-topic models from text corpora , 2010, TOIS.

[23]  B. Schölkopf,et al.  Hierarchical Dirichlet Processes with Random Effects , 2007 .

[24]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[25]  Ruslan Salakhutdinov,et al.  Evaluation methods for topic models , 2009, ICML '09.

[26]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[27]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[28]  Ian Davidson,et al.  Constrained Clustering: Advances in Algorithms, Theory, and Applications , 2008 .

[29]  Yee Whye Teh,et al.  Collapsed Variational Inference for HDP , 2007, NIPS.

[30]  Andrew McCallum,et al.  Optimizing Semantic Coherence in Topic Models , 2011, EMNLP.

[31]  Yee Whye Teh,et al.  On Smoothing and Inference for Topic Models , 2009, UAI.

[32]  Sergey I. Nikolenko,et al.  Additive Regularization for Topic Modeling in Sociological Studies of User-Generated Texts , 2016, MICAI.

[33]  A. N. Tikhonov,et al.  Solutions of ill-posed problems , 1977 .

[34]  Olessia Koltsova,et al.  Estimating Topic Modeling Performance with Sharma–Mittal Entropy , 2019, Entropy.

[35]  Clint P. George,et al.  Principled Selection of Hyperparameters in the Latent Dirichlet Allocation Model , 2017, J. Mach. Learn. Res..

[36]  Han Wang,et al.  Fast approximation of variational Bayes Dirichlet process mixture using the maximization-maximization algorithm , 2018, Int. J. Approx. Reason..

[37]  Max Welling,et al.  Distributed Algorithms for Topic Models , 2009, J. Mach. Learn. Res..

[38]  Konstantin Vorontsov,et al.  Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization , 2014, AIST.

[39]  Sergey I. Nikolenko,et al.  A Two-Step Soft Segmentation Procedure for MALDI Imaging Mass Spectrometry Data , 2012, GCB.

[40]  T. Minka Estimating a Dirichlet distribution , 2012 .

[41]  Etienne Barnard,et al.  Evaluating topic models with stability , 2008 .

[42]  Wei Jiang,et al.  Latent topic model for audio retrieval , 2014, Pattern Recognit..