论文信息 - Short-Text Topic Modeling via Non-negative Matrix Factorization Enriched with Local Word-Context Correlations

Short-Text Topic Modeling via Non-negative Matrix Factorization Enriched with Local Word-Context Correlations

Being a prevalent form of social communications on the Internet, billions of short texts are generated everyday. Discovering knowledge from them has gained a lot of interest from both industry and academia. The short texts have a limited contextual information, and they are sparse, noisy and ambiguous, and hence, automatically learning topics from them remains an important challenge. To tackle this problem, in this paper, we propose a semantics-assisted non-negative matrix factorization (SeaNMF) model to discover topics for the short texts. It effectively incorporates the word-context semantic correlations into the model, where the semantic relationships between the words and their contexts are learned from the skip-gram view of the corpus. The SeaNMF model is solved using a block coordinate descent algorithm. We also develop a sparse variant of the SeaNMF model which can achieve a better model interpretability. Extensive quantitative evaluations on various real-world short text datasets demonstrate the superior performance of the proposed models over several other state-of-the-art methods in terms of topic coherence and classification accuracy. The qualitative semantic analysis demonstrates the interpretability of our models by discovering meaningful and consistent topics. With a simple formulation and the superior performance, SeaNMF can be an effective standard topic model for short texts.

[1] Sinno Jialin Pan,et al. Short and Sparse Text Topic Modeling via Self-Aggregation , 2015, IJCAI.

[2] Hakan Ferhatosmanoglu,et al. Short text classification in twitter to improve information filtering , 2010, SIGIR.

[3] Hongfei Yan,et al. Comparing Twitter and Traditional Media Using Topic Models , 2011, ECIR.

[4] Andrzej Cichocki,et al. Nonnegative Matrix and Tensor Factorization T , 2007 .

[5] Chris H. Q. Ding,et al. Symmetric Nonnegative Matrix Factorization for Graph Clustering , 2012, SDM.

[6] Vivek Kumar Rangarajan Sridhar,et al. Unsupervised Topic Modeling for Short Texts Using Distributed Representations of Words , 2015, VS@HLT-NAACL.

[7] Haesun Park,et al. SymNMF: nonnegative low-rank approximation of a similarity matrix for graph clustering , 2014, Journal of Global Optimization.

[8] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[9] H. Sebastian Seung,et al. Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[10] Jaegul Choo,et al. UTOPIAN: User-Driven Topic Modeling Based on Interactive Nonnegative Matrix Factorization , 2013, IEEE Transactions on Visualization and Computer Graphics.

[11] Omer Levy,et al. Neural Word Embedding as Implicit Matrix Factorization , 2014, NIPS.

[12] Xiaohui Yan,et al. Learning Topics in Short Texts by Non-negative Matrix Factorization on Term Correlation Matrix , 2013, SDM.

[13] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[14] Jaegul Choo,et al. Nonnegative Matrix Factorization for Interactive Topic Modeling and Document Clustering , 2014 .

[15] Jaegul Choo,et al. Weakly supervised nonnegative matrix factorization for user-driven clustering , 2014, Data Mining and Knowledge Discovery.

[16] Jaegul Choo,et al. Simultaneous Discovery of Common and Discriminative Topics via Joint Nonnegative Matrix Factorization , 2015, KDD.

[17] Haesun Park,et al. Algorithms for nonnegative matrix and tensor factorizations: a unified view based on block coordinate descent framework , 2014, J. Glob. Optim..

[18] Heng Ji,et al. Harnessing web page directories for large-scale classification of tweets , 2013, WWW '13 Companion.

[19] Fenglong Ma,et al. Topic Discovery for Short Texts Using Word Embeddings , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[20] Richard A. Harshman,et al. Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[21] Chih-Jen Lin,et al. LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[22] Omer Levy,et al. Improving Distributional Similarity with Lessons Learned from Word Embeddings , 2015, TACL.

[23] Thomas Hofmann,et al. Probabilistic latent semantic indexing , 1999, SIGIR '99.

[24] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[25] Brian D. Davison,et al. Empirical study of topic modeling in Twitter , 2010, SOMA '10.

[26] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[27] Haixun Wang,et al. Understanding Short Texts , 2013, APWeb.

[28] Hui Xiong,et al. Topic Modeling of Short Texts: A Pseudo-Document View , 2016, KDD.

[29] Aixin Sun,et al. Topic Modeling for Short Texts with Auxiliary Word Embeddings , 2016, SIGIR.

[30] Xiaohui Yan,et al. A biterm topic model for short texts , 2013, WWW.