News Topic Discovery Through Community Detection

With the rapid development of communication and internet, there are a huge number of items of news every day. According to the characteristics of news dissemination, many pieces of news will focus on one topic about the same event or person. So, news topic discovery becomes a very important and urgent task in text mining. In fact, for news topic discovery, Latent Dirichlet Allocation (LDA) is the most frequently used model which considers each document being generated from a finite mixture of $K$ possible topics. However, the performance of LDA is not so satisfactory in practical applications. In this paper, we try to solve this problem through text structure mining. Our proposed method consists of two steps. The first step is to find out the topics as the clusters or communities of all the news items through the method of community detection, while the second step is to utilize the Bayesian unigram model to obtain the topic tokens for each topic. It is demonstrated by the experimental results that our proposed method can find out the topics much better than LDA on a real world news dataset.

[1]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[2]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[3]  Jing Li,et al.  Directional Skip-Gram: Explicitly Distinguishing Left and Right Context for Word Embeddings , 2018, NAACL.

[4]  Michael R. Brent,et al.  An Efficient, Probabilistically Sound Algorithm for Segmentation and Word Discovery , 1999, Machine Learning.

[5]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[6]  Hristo Djidjev,et al.  A Scalable Multilevel Algorithm for Graph Clustering and Community Structure Detection , 2007, WAW.

[7]  Andrew McCallum,et al.  Using Maximum Entropy for Text Classification , 1999 .

[8]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[9]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[10]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[11]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[12]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[13]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[14]  Stephen Clark,et al.  Chinese Segmentation with a Word-Based Perceptron Algorithm , 2007, ACL.