How Many Topics? Stability Analysis for Topic Models

Topic modeling refers to the task of discovering the underlying thematic structure in a text corpus, where the output is commonly presented as a report of the top terms appearing in each topic. Despite the diversity of topic modeling algorithms that have been proposed, a common challenge in successfully applying these techniques is the selection of an appropriate number of topics for a given corpus. Choosing too few topics will produce results that are overly broad, while choosing too many will result in the"over-clustering" of a corpus into many small, highly-similar topics. In this paper, we propose a term-centric stability analysis strategy to address this issue, the idea being that a model with an appropriate number of topics will be more robust to perturbations in the data. Using a topic modeling approach based on matrix factorization, evaluations performed on a range of corpora show that this strategy can successfully guide the model selection process.

[1]  悠太 菊池,et al.  大規模要約資源としてのNew York Times Annotated Corpus , 2015 .

[2]  Thomas L. Griffiths,et al.  Probabilistic Topic Models , 2007 .

[3]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[4]  P. Jaccard THE DISTRIBUTION OF THE FLORA IN THE ALPINE ZONE.1 , 1912 .

[5]  H. Kuhn The Hungarian method for the assignment problem , 1955 .

[6]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[7]  A. Bertoni,et al.  Random projections for assessing gene expression cluster stability , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[8]  Alistair Moffat,et al.  A similarity measure for indefinite rankings , 2010, TOIS.

[9]  Feng Qianjin,et al.  Projected gradient methods for Non-negative Matrix Factorization based relevance feedback algorithm in medical image retrieval , 2011 .

[10]  Ronald Fagin,et al.  Comparing top k lists , 2003, SODA '03.

[11]  Joachim M. Buhmann,et al.  Stability-Based Validation of Clustering Solutions , 2004, Neural Computation.

[12]  James Bailey,et al.  COALA: A Novel Approach for the Extraction of an Alternate Clustering of High Quality and High Dissimilarity , 2006, Sixth International Conference on Data Mining (ICDM'06).

[13]  Sanjeev Arora,et al.  Learning Topic Models -- Going beyond SVD , 2012, 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science.

[14]  Etienne Barnard,et al.  Evaluating topic models with stability , 2008 .

[15]  Eytan Domany,et al.  Resampling Method for Unsupervised Estimation of Cluster Validity , 2001, Neural Computation.

[16]  Sergei Vassilvitskii,et al.  Generalized distances between rankings , 2010, WWW '10.

[17]  Christos Boutsidis,et al.  SVD based initialization: A head start for nonnegative matrix factorization , 2008, Pattern Recognit..

[18]  Quan Wang,et al.  Group matrix factorization for scalable topic modeling , 2012, SIGIR '12.

[19]  Derek Greene,et al.  Producing Accurate Interpretable Clusters from High-Dimensional Data , 2005, PKDD.

[20]  Pablo Tamayo,et al.  Metagenes and molecular pattern discovery using matrix factorization , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Derek Greene,et al.  Efficient Prediction-Based Validation for Document Clustering , 2006, ECML.

[22]  C. F. Kossack,et al.  Rank Correlation Methods , 1949 .

[23]  Shai Ben-David,et al.  Stability of k -Means Clustering , 2007, COLT.

[24]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[25]  Padraig Cunningham,et al.  Community detection: effective evaluation on large social networks , 2014, J. Complex Networks.

[26]  Lucie N. Hutchins,et al.  Position-dependent motif characterization using non-negative matrix factorization , 2008, Bioinform..

[27]  Timothy Baldwin,et al.  Automatic Evaluation of Topic Coherence , 2010, NAACL.