Utilizing Term Proximity based Features to Improve Text Document Clustering

Measuring inter-document similarity is one of the most essential steps in text document clustering. Traditional methods rely on representing text documents using the simple Bag-of-Words (BOW) model which assumes that terms of a text document are independent of each other. Such single term analysis of the text completely ignores the underlying (semantic) structure of a document. In the literature, sufficient efforts have been made to enrich BOW representation using phrases and n-grams like bi-grams and tri-grams. These approaches take into account dependency only between adjacent terms or a continuous sequence of terms. However, while some of the dependencies exist between adjacent words, others are more distant. In this paper, we make an effort to enrich traditional document vector by adding the notion of term-pair features. A Term-Pair feature is a pair of two terms of the same document such that they may be adjacent to each other or distant. We investigate the process of term-pair selection and propose a methodology to select potential term-pairs from the given document. Utilizing term proximity between distant terms also allows some flexibility for two documents to be similar if they are about similar topics but with varied writing styles. Experimental results on standard web document data set show that the clustering performance is substantially improved by adding

[1]  Yong Yu,et al.  Viewing Term Proximity from a Different Perspective , 2008, ECIR.

[2]  Ronan Cummins,et al.  Learning in a pairwise term-term proximity framework for information retrieval , 2009, SIGIR.

[3]  Jacques Savoy,et al.  Term Proximity Scoring for Keyword-Based Retrieval Systems , 2003, ECIR.

[4]  Jinglei Zhao,et al.  A proximity language model for information retrieval , 2009, SIGIR.

[5]  Mohamed S. Kamel,et al.  Efficient phrase-based document indexing for Web document clustering , 2004, IEEE Transactions on Knowledge and Data Engineering.

[6]  Edward A. Fox,et al.  Recent Developments in Document Clustering , 2007 .

[7]  R. Bekkerman,et al.  Using Bigrams in Text Categorization , 2003 .

[8]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[9]  Per Ahlgren,et al.  Document-document similarity approaches and science mapping: Experimental comparison of five approaches , 2009, J. Informetrics.

[10]  Joel L. Fagan,et al.  The effectiveness of a nonsyntactic approach to automatic phrase indexing for document retrieval , 1989, JASIS.

[11]  John D. Lafferty,et al.  A Model of Lexical Attraction and Repulsion , 1997, ACL.

[12]  Xiaotie Deng,et al.  A new suffix tree similarity measure for document clustering , 2007, WWW '07.

[13]  David Hawking,et al.  Relevance weighting using distance between term occurrences , 1996 .

[14]  Norbert Fuhr,et al.  Probabilistic Models in Information Retrieval , 1992, Comput. J..

[15]  W. Bruce Croft,et al.  Using Probabilistic Models of Document Retrieval without Relevance Information , 1979, J. Documentation.

[16]  Tao Tao,et al.  An exploration of proximity measures in information retrieval , 2007, SIGIR.

[17]  Oren Etzioni,et al.  Grouper: A Dynamic Clustering Interface to Web Search Results , 1999, Comput. Networks.