Short text classification based on strong feature thesaurus

Data sparseness, the evident characteristic of short text, has always been regarded as the main cause of the low accuracy in the classification of short texts using statistical methods. Intensive research has been conducted in this area during the past decade. However, most researchers failed to notice that ignoring the semantic importance of certain feature terms might also contribute to low classification accuracy. In this paper we present a new method to tackle the problem by building a strong feature thesaurus (SFT) based on latent Dirichlet allocation (LDA) and information gain (IG) models. By giving larger weights to feature terms in SFT, the classification accuracy can be improved. Specifically, our method appeared to be more effective with more detailed classification. Experiments in two short text datasets demonstrate that our approach achieved improvement compared with the state-of-the-art methods including support vector machine (SVM) and Naïve Bayes Multinomial.

[1]  Susumu Horiguchi,et al.  A Hidden Topic-Based Framework toward Building Applications with Short Web Documents , 2011, IEEE Transactions on Knowledge and Data Engineering.

[2]  Danushka Bollegala,et al.  Using Multiple Sources to Construct a Sentiment Sensitive Thesaurus for Cross-Domain Sentiment Classification , 2011, ACL.

[3]  Nan Sun,et al.  Exploiting internal and external semantics for the clustering of short texts using world knowledge , 2009, CIKM.

[4]  Gang Liu,et al.  Short text similarity based on probabilistic topics , 2009, Knowledge and Information Systems.

[5]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[6]  Wanli Zuo,et al.  SVM based adaptive learning method for text classification from positive and unlabeled documents , 2008, Knowledge and Information Systems.

[7]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[8]  Susumu Horiguchi,et al.  Learning to classify short and sparse text & web with hidden topics from large-scale data collections , 2008, WWW.

[9]  Christopher Meek,et al.  Improving Similarity Measures for Short Segments of Text , 2007, AAAI.

[10]  Susan T. Dumais,et al.  Similarity Measures for Short Segments of Text , 2007, ECIR.

[11]  Zuhair Bandar,et al.  Sentence similarity based on semantic nets and corpus statistics , 2006, IEEE Transactions on Knowledge and Data Engineering.

[12]  Evgeniy Gabrilovich,et al.  Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge , 2006, AAAI.

[13]  Mehran Sahami,et al.  A web-based kernel function for measuring the similarity of short text snippets , 2006, WWW '06.

[14]  Gong Ling,et al.  An improved TF-IDF approach for text classification , 2005 .

[15]  Evgeniy Gabrilovich,et al.  Feature Generation for Text Categorization Using World Knowledge , 2005, IJCAI.

[16]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[18]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[19]  Gregor Heinrich Parameter estimation for text analysis , 2009 .

[20]  Danushka Bollegala,et al.  Measuring semantic similarity between words using web search engines , 2007, WWW '07.

[21]  Edward J. Wegman,et al.  Statistical Signal Processing , 1985 .