Wikipedia-Based Smoothing for Enhancing Text Clustering

The conventional algorithms for text clustering that are based on the bag of words model, fail to fully capture the semantic relations between the words. As a result, documents describing an identical topic may not be categorized into same clusters if they use different sets of words. A generic solution for this issue is to utilize background knowledge to enrich the document contents. In this research, we adopt a language modeling approach for text clustering and propose to smooth the document language models using Wikipedia articles in order to enhance text clustering performance. The contents of Wikipedia articles as well as their assigned categories are used in three different ways to smooth the document language models with the goal of enriching the document contents. Clustering is then performed on a document similarity graph constructed on the enhanced document collection. Experiment results confirm the effectiveness of the proposed methods.

[1]  Joydeep Ghosh,et al.  Under Consideration for Publication in Knowledge and Information Systems Generative Model-based Document Clustering: a Comparative Study , 2003 .

[2]  Somnath Banerjee,et al.  Clustering short texts using wikipedia , 2007, SIGIR.

[3]  David Carmel,et al.  Enhancing cluster labeling using wikipedia , 2009, SIGIR.

[4]  Hongyuan Zha,et al.  Web document clustering using hyperlink structures , 2001 .

[5]  Hua Li,et al.  Enhancing text clustering by leveraging Wikipedia semantics , 2008, SIGIR '08.

[6]  Evgeniy Gabrilovich,et al.  Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge , 2006, AAAI.

[7]  Oren Kurland,et al.  PageRank without hyperlinks: structural re-ranking using links induced by language models , 2005, SIGIR '05.

[8]  David D. Lewis,et al.  Reuters-21578 Text Categorization Test Collection, Distribution 1.0 , 1997 .

[9]  Charles L.A. Clarke,et al.  SIGIR '07, Amsterdam : proceedings : 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, July 23-27, 2007, Amsterdam, the Netherlands , 2007 .

[10]  Yang Xu,et al.  Query dependent pseudo-relevance feedback based on wikipedia , 2009, SIGIR.

[11]  ChengXiang Zhai,et al.  Statistical Language Models for Information Retrieval: A Critical Review , 2008, Found. Trends Inf. Retr..

[12]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[13]  Steffen Staab,et al.  WordNet improves text document clustering , 2003, SIGIR 2003.

[14]  Dimitar Kazakov,et al.  WordNet-based text document clustering , 2004 .

[15]  Xiaohua Hu,et al.  Exploiting Wikipedia as external knowledge for document clustering , 2009, KDD.

[16]  Jian Hu,et al.  Improving Text Classification by Using Encyclopedia Knowledge , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[17]  Evgeniy Gabrilovich,et al.  Feature Generation for Text Categorization Using World Knowledge , 2005, IJCAI.