Crest: Cluster-based Representation Enrichment for Short Text Classification

Text classification has gained research interests for decades. Many techniques have been developed and have demonstrated very good classification accuracies in various applications. Recently, the popularity of social platforms has changed the way we access (and contribute) information. Particularly, short messages, comments, and status updates, are now becoming a large portion of the online text data. The shortness, and more importantly, the sparsity, of the short text data call for a revisit of text classification techniques developed for well-written documents such as news articles. In this paper, we propose a cluster-based representation enrichment method, namely Crest, to deal with the shortness and sparsity of short text. More specifically, we propose to enrich a short text representation by incorporating a vector of topical relevances in addition to the commonly adopted tf-idf representation. The topics are derived from the knowledge embedded in the short text collection of interest by using hierarchical clustering algorithm with purity control. Our experiments show that the enriched representation significantly improves the accuracy of short text classification. The experiments were conducted on a benchmark dataset consisting of Web snippets using Support Vector Machines (SVM) as the classifier.

[1]  Wei-Ying Ma,et al.  Web-page classification through summarization , 2004, SIGIR '04.

[2]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[3]  Aixin Sun,et al.  Short text classification using very few words , 2012, SIGIR '12.

[4]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[5]  Thomas Hofmann,et al.  Text categorization by boosting automatically extracted concepts , 2003, SIGIR.

[6]  Danushka Bollegala,et al.  Measuring semantic similarity between words using web search engines , 2007, WWW '07.

[7]  Mengen Chen,et al.  Short Text Classification Improved by Learning Multi-Granularity Topics , 2011, IJCAI.

[8]  Christopher Meek,et al.  Improving Similarity Measures for Short Segments of Text , 2007, AAAI.

[9]  Felix Naumann,et al.  Data fusion , 2009, CSUR.

[10]  Xiaohua Hu,et al.  Exploiting Wikipedia as external knowledge for document clustering , 2009, KDD.

[11]  Huan Liu,et al.  Enriching short text representation in microblog for clustering , 2012, Frontiers of Computer Science.

[12]  Qiang Yang,et al.  Query enrichment for web-query classification , 2006, TOIS.

[13]  Dawid Weiss,et al.  A survey of Web clustering engines , 2009, CSUR.

[14]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[15]  Hua Li,et al.  Enhancing text clustering by leveraging Wikipedia semantics , 2008, SIGIR '08.

[16]  Nan Sun,et al.  Exploiting internal and external semantics for the clustering of short texts using world knowledge , 2009, CIKM.

[17]  Mehran Sahami,et al.  A web-based kernel function for measuring the similarity of short text snippets , 2006, WWW '06.

[18]  Susumu Horiguchi,et al.  Learning to classify short and sparse text & web with hidden topics from large-scale data collections , 2008, WWW.