Stability of topic modeling via matrix factorization

The problem of the instability of standard topic modeling algorithms is investigated.Three new stability measures for topic models are proposed.Two new ensemble approaches for topic modeling with matrix factorization are proposed.A detailed evaluation of these approaches is performed on 10 text corpora. Topic models can provide us with an insight into the underlying latent structure of a large corpus of documents. A range of methods have been proposed in the literature, including probabilistic topic models and techniques based on matrix factorization. However, in both cases, standard implementations rely on stochastic elements in their initialization phase, which can potentially lead to different results being generated on the same corpus when using the same parameter values. This corresponds to the concept of instability which has previously been studied in the context of k-means clustering. In many applications of topic modeling, this problem of instability is not considered and topic models are treated as being definitive, even though the results may change considerably if the initialization process is altered. In this paper we demonstrate the inherent instability of popular topic modeling approaches, using a number of new measures to assess stability. To address this issue in the context of matrix factorization for topic modeling, we propose the use of ensemble learning strategies. Based on experiments performed on annotated text corpora, we show that a K-Fold ensemble strategy, combining both ensembles and structured initialization, can significantly reduce instability, while simultaneously yielding more accurate topic models.

[1]  Ting Su,et al.  A deterministic method for initializing K-means clustering , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[2]  Hassan A. Kingravi,et al.  Deterministic Initialization of the k-Means Algorithm using Hierarchical Clustering , 2012, Int. J. Pattern Recognit. Artif. Intell..

[3]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[4]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[5]  Jaegul Choo,et al.  L-EnsNMF: Boosted Local Topic Discovery via Ensemble of Nonnegative Matrix Factorization , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[6]  Ludmila I. Kuncheva,et al.  Using diversity in cluster ensembles , 2004, 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583).

[7]  Xin Yao,et al.  Diversity creation methods: a survey and categorisation , 2004, Inf. Fusion.

[8]  Jaegul Choo,et al.  Nonnegative Matrix Factorization for Interactive Topic Modeling and Document Clustering , 2014 .

[9]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[10]  A. Bertoni,et al.  Random projections for assessing gene expression cluster stability , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[11]  Ludmila I. Kuncheva,et al.  Moderate diversity for better cluster ensembles , 2006, Inf. Fusion.

[12]  Pedro Larrañaga,et al.  An empirical comparison of four initialization methods for the K-Means algorithm , 1999, Pattern Recognit. Lett..

[13]  Timothy Baldwin,et al.  Automatic Evaluation of Topic Coherence , 2010, NAACL.

[14]  Weizhong Zhao,et al.  A heuristic approach to determine an appropriate number of topics in topic modeling , 2015, BMC Bioinformatics.

[15]  Sanjeev Arora,et al.  Learning Topic Models -- Going beyond SVD , 2012, 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science.

[16]  Isabelle Guyon,et al.  A Stability Based Method for Discovering Structure in Clustered Data , 2001, Pacific Symposium on Biocomputing.

[17]  Derek Greene,et al.  Ensemble non-negative matrix factorization methods for clustering protein-protein interactions , 2008, Bioinform..

[18]  Thomas L. Griffiths,et al.  Probabilistic Topic Models , 2007 .

[19]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[20]  Gerlof Bouma,et al.  Normalized (pointwise) mutual information in collocation extraction , 2009 .

[21]  Stefan M. Wild,et al.  Improving non-negative matrix factorizations through structured initialization , 2004, Pattern Recognit..

[22]  Anil K. Jain,et al.  Clustering ensembles: models of consensus and weak partitions , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Ludmila I. Kuncheva,et al.  Evaluation of Stability of k-Means Cluster Ensembles with Respect to Random Initialization , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[25]  Kilian Q. Weinberger,et al.  Snapshot Ensembles: Train 1, get M for free , 2017, ICLR.

[26]  Michelangelo Ceci,et al.  ComiRNet: a web-based system for the analysis of miRNA-gene regulatory networks , 2015, BMC Bioinformatics.

[27]  Chih-Jen Lin,et al.  Projected Gradient Methods for Nonnegative Matrix Factorization , 2007, Neural Computation.

[28]  Joydeep Ghosh,et al.  Cluster Ensembles A Knowledge Reuse Framework for Combining Partitionings , 2002, AAAI/IAAI.

[29]  Paul S. Bradley,et al.  Refining Initial Points for K-Means Clustering , 1998, ICML.

[30]  Derek Greene,et al.  How Many Topics? Stability Analysis for Topic Models , 2014, ECML/PKDD.

[31]  Christos Boutsidis,et al.  SVD based initialization: A head start for nonnegative matrix factorization , 2008, Pattern Recognit..

[32]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[33]  H. Kuhn The Hungarian method for the assignment problem , 1955 .

[34]  Derek Greene,et al.  An analysis of the coherence of descriptors in topic modeling , 2015, Expert Syst. Appl..

[35]  Francisco Herrera,et al.  Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power , 2010, Inf. Sci..

[36]  William F. Punch,et al.  Ensembles of partitions via data resampling , 2004, International Conference on Information Technology: Coding and Computing, 2004. Proceedings. ITCC 2004..