Exploiting internal and external semantics for the clustering of short texts using world knowledge

Clustering of short texts, such as snippets, presents great challenges in existing aggregated search techniques due to the problem of data sparseness and the complex semantics of natural language. As short texts do not provide sufficient term occurring information, traditional text representation methods, such as ``bag of words" model, have several limitations when directly applied to short texts tasks. In this paper, we propose a novel framework to improve the performance of short texts clustering by exploiting the internal semantics from original text and external concepts from world knowledge. The proposed method employs a hierarchical three-level structure to tackle the data sparsity problem of original short texts and reconstruct the corresponding feature space with the integration of multiple semantic knowledge bases -- Wikipedia and WordNet. Empirical evaluation with Reuters and real web dataset demonstrates that our approach is able to achieve significant improvement as compared to the state-of-the-art methods.

[1]  Nan Sun,et al.  Query Segmentation Based on Eigenspace Similarity , 2009, ACL/IJCNLP.

[2]  Steffen Staab,et al.  Ontologies improve text document clustering , 2003, Third IEEE International Conference on Data Mining.

[3]  Iryna Gurevych,et al.  Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary , 2008, LREC.

[4]  Luis Alfonso Ureña López,et al.  Integrating Linguistic Resources in TC through WSD , 2001, Comput. Humanit..

[5]  Evgeniy Gabrilovich,et al.  Feature Generation for Text Categorization Using World Knowledge , 2005, IJCAI.

[6]  Charles L. A. Clarke,et al.  Frequency Estimates for Statistical Word Similarity Measures , 2003, NAACL.

[7]  Ian Witten,et al.  Data Mining , 2000 .

[8]  Steffen Staab,et al.  WordNet improves text document clustering , 2003, SIGIR 2003.

[9]  Mounia Lalmas,et al.  Using digest pages to increase user result space: Preliminary designs , 2008 .

[10]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[11]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[12]  Ludovic Denoyer,et al.  The Wikipedia XML Corpus , 2006, INEX.

[13]  Oren Etzioni,et al.  Grouper: A Dynamic Clustering Interface to Web Search Results , 1999, Comput. Networks.

[14]  Mehran Sahami,et al.  A web-based kernel function for measuring the similarity of short text snippets , 2006, WWW '06.

[15]  David R. Karger,et al.  Constant interaction-time scatter/gather browsing of very large document collections , 1993, SIGIR.

[16]  Theodore Marinis,et al.  Psycholinguistic techniques in second language acquisition research , 2003 .

[17]  Frank Keller,et al.  Using the Web to Overcome Data Sparseness , 2002, EMNLP.

[18]  Susan T. Dumais,et al.  Similarity Measures for Short Segments of Text , 2007, ECIR.

[19]  Somnath Banerjee,et al.  Clustering short texts using wikipedia , 2007, SIGIR.

[20]  Susumu Horiguchi,et al.  Learning to classify short and sparse text & web with hidden topics from large-scale data collections , 2008, WWW.

[21]  Ludovic Denoyer,et al.  The XML Wikipedia Corpus , 2006 .

[22]  Hsin-Hsi Chen,et al.  Novel Association Measures Using Web Search with Double Checking , 2006, ACL.

[23]  David M. Pennock,et al.  Mining the peanut gallery: opinion extraction and semantic classification of product reviews , 2003, WWW '03.

[24]  Xiaotie Deng,et al.  Efficient Phrase-Based Document Similarity for Clustering , 2008, IEEE Transactions on Knowledge and Data Engineering.

[25]  Wei-Ying Ma,et al.  Learning to cluster web search results , 2004, SIGIR '04.

[26]  Dawid Weiss,et al.  Lingo: Search Results Clustering Algorithm Based on Singular Value Decomposition , 2004, Intelligent Information Systems.

[27]  Danushka Bollegala,et al.  Measuring semantic similarity between words using web search engines , 2007, WWW '07.

[28]  Evgeniy Gabrilovich,et al.  Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge , 2006, AAAI.

[29]  Michael Collins,et al.  Three Generative, Lexicalised Models for Statistical Parsing , 1997, ACL.

[30]  W. Bruce Croft,et al.  Term clustering of syntactic phrases , 1989, SIGIR '90.

[31]  Walter Daelemans,et al.  Introduction to Special Issue on Machine Learning Approaches to Shallow Parsing , 2002, J. Mach. Learn. Res..

[32]  James Allan,et al.  Text classification and named entities for new event detection , 2004, SIGIR '04.

[33]  Wee Sun Lee,et al.  Learning Semantic Classes for Word Sense Disambiguation , 2005, ACL.

[34]  Hua Li,et al.  Enhancing text clustering by leveraging Wikipedia semantics , 2008, SIGIR '08.

[35]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.