Data Sparseness, the evident characteristic of short text, is caused by the diversity of language expression and the short text length. The previous text models represented by Bag of Word (BOW) only considers the statistical feature of words, and thus always underperformed when it comes to short texts. To tackle this problem, we introduced a new text model by combining the statistical method and semantic estimation. Specifically, we managed to obtain the “Strong Feature Thesaurus” through mining process with Latent Dirichlet allocation (LDA) model, and then the semantic information is incorporated in the BOW by weighting those strong feature terms. To assess the performance of this model, we conduct two experiments of the clustering of short text corpuses. The results have shown that our model outperform the prevailing text models such as BOW. Introduction With the rapid development of network technology, more and more users want to share their interested information on the network, typical application forms such as blogs, Twitter, social networking services(SNS). The user can communicate more convenient, timely information and express their opinions, resulting in a large number of comments and opinions with personal emotion. Those online messages, which are classified as short texts, all share some common characteristics namely the short message length and intense user participation. Short texts can reach topics of all kinds and are of increasing informational importance. Modeling method of short text is through the core of all the possible operation on the short text. The name of it can list long classification, similarity computation, short text data mining. Therefore, analysis and application of it has a wide range of public opinion, topic tracking and consumer preference indication. The information content of short length difference is characteristic of short text, leading to some topic chain is weak. More importantly, because of the diversity of languages, the same theme can be in completely different ways of expression, thus reducing the possibility of the feature in the short text of several different.Therefore, the occurrence of long-term cooperative modeling often fail to improve its accuracy due to sparse data based on short text. Intensive research has been conducted to solve the data sparseness problem and improve the modeling accuracy of short text. The implicit themes based on X H phan forward the "bag of words (bow) + modeling method for short text classification [1]theme". The United States tries to short text clustering [introduction "this one concept modeling method and application of arch and the wiki" 2] to solve the problem. Other effective methods including Hu, X "simknow" modeling method is based on clustering Wikipedia and the world [3 article]. Based on the LDA model, a proposed biterm topic model (BTM) of short text topic [4] modeling. Although these studies and consider the semantic information hidden in the feature words, they cannot distinguish between them. We know that the different contribution of different feature often in the themes identified. In order to further improve the accuracy of short text modeling, we must take into account the semantic importance of certain feature terms. Inspired by the “structure+ average” method in International Conference on Intelligent Systems Research and Mechatronics Engineering (ISRME 2015) © 2015. The authors Published by Atlantis Press 620 probabilistic graphical models [5], we managed to propose a new model combining both statistical and semantic information of feature terms. We managed to discriminate the feature terms by putting them into different groups according to their influence on the semantic information of the whole piece of text. Firstly, we established a “strong feature thesaurus” on the basis of Latent Dirichlet Allocation (LDA) model. Then, we put larger weight on feature terms which have significant semantic importance. Thus, the discriminative power of strong feature terms is strengthened. Experimental results suggest that our model improved the purity of short text clustering. The outline of the paper is as follows. Section II describes the general framework of our new method for short text modeling; Section III discusses the establishment of the “Strong Feature Thesaurus” as well as the procedure of weighting them; Section IV presents our main experimental process and corresponding analysis of results; finally, section V concludes the paper. The General Framework There’re basically two different approaches in text modeling. One is the traditional BOW, and the other is expands BOW such as “BOW + WordNet”. These two approaches are primarily based on the analysis of feature terms’ statistical information and their literal meaning. However, the diversity of the language expressions makes it especially difficult to determine semantic meaning of words within context. As a result, these methods share a common problem that the accuracy of modeling is often limited. The sparseness of short text makes it worse that the accuracy of short text modeling is basically lower than that of common texts. In our model, we incorporate the domain knowledge, which is obtained through mining on large datasets. With the help of domain knowledge, we treat the strong feature terms respectively by giving them larger weight. The general framework is depicted in Fig.1.
[1]
Jiafeng Guo,et al.
BTM: Topic Modeling over Short Texts
,
2014,
IEEE Transactions on Knowledge and Data Engineering.
[2]
Bing Liu,et al.
Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data
,
2006,
Data-Centric Systems and Applications.
[3]
Somnath Banerjee,et al.
Clustering short texts using wikipedia
,
2007,
SIGIR.
[4]
Nir Friedman,et al.
Probabilistic Graphical Models - Principles and Techniques
,
2009
.
[5]
Susumu Horiguchi,et al.
Learning to classify short and sparse text & web with hidden topics from large-scale data collections
,
2008,
WWW.
[6]
อนิรุธ สืบสิงห์,et al.
Data Mining Practical Machine Learning Tools and Techniques
,
2014
.
[7]
Nan Sun,et al.
Exploiting internal and external semantics for the clustering of short texts using world knowledge
,
2009,
CIKM.