论文信息 - A T EXT MINING RESEARCH BASED ON LDA T OPIC MODELLING

A T EXT MINING RESEARCH BASED ON LDA T OPIC MODELLING

A Large number of digital text information is generated every day. Effectively searching, managing and exploring the text data has become a main task. In this paper, we first represent an introduction to text mining and a probabilistic topic model Latent Dirichlet allocation. Then two experiments are proposed - Wikipedia articles and users’ tweets topic modelling. The former one builds up a document topic model, aiming to a topic perspective solution on searching, exploring and recommending articles. The latter one sets up a user topic model, providing a full research and analysis over Twitter users’ interest. The experiment process including data collecting, data pre-processing and model training is fully documented and commented. Further more, the conclusion and application of this paper could be a useful computation tool for social and business research.

Haiyi Zhang | Zhou Tong | Haiyi Zhang | Zhou Tong

[1] David M. Blei,et al. Probabilistic topic models , 2012, Commun. ACM.

[2] Martin Ponweiser,et al. Latent Dirichlet Allocation in R , 2012 .

[3] John D. Lafferty,et al. A correlated topic model of Science , 2007, 0708.3601.

[4] Jianhua Lin,et al. Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[5] Khairullah Khan,et al. A Review of Machine Learning Algorithms for Text-Documents Classification , 2010 .

[6] Kurt Hornik,et al. topicmodels : An R Package for Fitting Topic Models , 2016 .