论文信息 - Design and Analysis of a Weight-LDA Model to Extract Implicit Topic of Database in Social Networks

Design and Analysis of a Weight-LDA Model to Extract Implicit Topic of Database in Social Networks

In the era of big data, the volumes of data are in increasingly rapid growth in social networks. Social networks are a theoretical construct, which is useful in the social sciences to study relationships and interactions between individuals, group, organizations. Massive data processing is essential for providing social network services. In this paper, we focus on the extraction of the implicit aspect and opinion words in social networks. The Latent Dirichlet Allocation (LDA) model is a generative probabilistic model to automatically extract implicit topic in the document set, which has been widely used in natural language processing, text mining and text categorization. However, a large number of non-taxonomy high-frequency content words in the Chinese patent documents will affect the implicit topic generation, and for the more, affect Chinese patent classification. The study finds that the probability distribution of the words in the expert database has an impact on the extraction of the feature words for patent document. This paper proposes a weight-LDA model for the problem of the LDA topic model in Chinese patent classification. The weight-LDA model, which combines the probability distribution of feature words in the expert database with Gibbs sampling, reduces the impact of non-taxonomy high-frequency content words on the distribution of topic and enhances that of low-frequency content words with strong classification effects on the distribution of topic. Six different types of patent data sets extracted from State Intellectual Property Office of the P.R.C are tested. The average F value of the weight-LDA model is 6% higher than that of the traditional LDA model. In addition, the weight-LDA model is compared with word-frequency- based feature selection methods such as the TFIDF algorithm, and the average F value of the weight-LDA model is 11.4% higher than that of the TF-IDF algorithm. Through the analysis of the experimental results, the weight-LDA for the Chinese patent has better classification effects.

Li Huang | Cong Zhang | Neal N. Xiong | Shenghua Xu | Guoxiong Hu