论文信息 - How Many Topics? Stability Analysis for Topic Models - 字舞流文

How Many Topics? Stability Analysis for Topic Models

Topic modeling refers to the task of discovering the underlying thematic structure in a text corpus, where the output is commonly presented as a report of the top terms appearing in each topic. Despite the diversity of topic modeling algorithms that have been proposed, a common challenge in successfully applying these techniques is the selection of an appropriate number of topics for a given corpus. Choosing too few topics will produce results that are overly broad, while choosing too many will result in the"over-clustering" of a corpus into many small, highly-similar topics. In this paper, we propose a term-centric stability analysis strategy to address this issue, the idea being that a model with an appropriate number of topics will be more robust to perturbations in the data. Using a topic modeling approach based on matrix factorization, evaluations performed on a range of corpora show that this strategy can successfully guide the model selection process.

Derek Greene | Padraig Cunningham | Derek O'Callaghan | P. Cunningham | Derek Greene | D. O'Callaghan

[1] 悠太菊池,et al. 大規模要約資源としてのNew York Times Annotated Corpus , 2015 .

[2] Thomas L. Griffiths,et al. Probabilistic Topic Models , 2007 .

[3] H. Sebastian Seung,et al. Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[4] P. Jaccard. THE DISTRIBUTION OF THE FLORA IN THE ALPINE ZONE.1 , 1912 .

[5] H. Kuhn. The Hungarian method for the assignment problem , 1955 .

[6] Thomas Hofmann,et al. Probabilistic Latent Semantic Analysis , 1999, UAI.

[7] A. Bertoni,et al. Random projections for assessing gene expression cluster stability , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[8] Alistair Moffat,et al. A similarity measure for indefinite rankings , 2010, TOIS.

[9] Feng Qianjin,et al. Projected gradient methods for Non-negative Matrix Factorization based relevance feedback algorithm in medical image retrieval , 2011 .

[10] Ronald Fagin,et al. Comparing top k lists , 2003, SODA '03.

[11] Joachim M. Buhmann,et al. Stability-Based Validation of Clustering Solutions , 2004, Neural Computation.

[12] James Bailey,et al. COALA: A Novel Approach for the Extraction of an Alternate Clustering of High Quality and High Dissimilarity , 2006, Sixth International Conference on Data Mining (ICDM'06).

[13] Sanjeev Arora,et al. Learning Topic Models -- Going beyond SVD , 2012, 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science.

[14] Etienne Barnard,et al. Evaluating topic models with stability , 2008 .

[15] Eytan Domany,et al. Resampling Method for Unsupervised Estimation of Cluster Validity , 2001, Neural Computation.

[16] Sergei Vassilvitskii,et al. Generalized distances between rankings , 2010, WWW '10.

[17] Christos Boutsidis,et al. SVD based initialization: A head start for nonnegative matrix factorization , 2008, Pattern Recognit..

[18] Quan Wang,et al. Group matrix factorization for scalable topic modeling , 2012, SIGIR '12.

[19] Derek Greene,et al. Producing Accurate Interpretable Clusters from High-Dimensional Data , 2005, PKDD.

[20] Pablo Tamayo,et al. Metagenes and molecular pattern discovery using matrix factorization , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[21] Derek Greene,et al. Efficient Prediction-Based Validation for Document Clustering , 2006, ECML.

[22] C. F. Kossack,et al. Rank Correlation Methods , 1949 .

[23] Shai Ben-David,et al. Stability of k -Means Clustering , 2007, COLT.

[24] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[25] Padraig Cunningham,et al. Community detection: effective evaluation on large social networks , 2014, J. Complex Networks.

[26] Lucie N. Hutchins,et al. Position-dependent motif characterization using non-negative matrix factorization , 2008, Bioinform..

[27] Timothy Baldwin,et al. Automatic Evaluation of Topic Coherence , 2010, NAACL.