PhraseRank for document clustering: reweighting the weight of phrase

Given a document collection, a hierarchical clustering algorithm groups several clusters. Recent works have identified the set of overlap phrases as useful features in hierarchical document clustering. However, they did not consider the relationship between co-occurred overlap phrases in a document and degrees of opposite relationships between overlap phrases. In this paper, we propose new algorithms for effective similarity measure before working hierarchical clustering algorithm. There are two important features in the proposed methods: the ranking list of top-k phrases for each particular overlap phrase and the opposite significances between two overlap phrases with each other. Experiment result shows that proposed method improves the results of clustering.

[1]  Xiaotie Deng,et al.  A new suffix tree similarity measure for document clustering , 2007, WWW '07.

[2]  Xiaohua Hu,et al.  Semantic Smoothing for Model-based Document Clustering , 2006, Sixth International Conference on Data Mining (ICDM'06).

[3]  Hasan Davulcu,et al.  Term Ranking for Clustering Web Search Results , 2007, WebDB.

[4]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[5]  Mohamed S. Kamel,et al.  Efficient phrase-based document indexing for Web document clustering , 2004, IEEE Transactions on Knowledge and Data Engineering.

[6]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[7]  Mitsuru Ishizuka,et al.  Keyword extraction from a single document using word co-occurrence statistical information , 2004, Int. J. Artif. Intell. Tools.

[8]  Mina Akaishi,et al.  An Associative Information Retrieval Based on the Dependency of Term Co-occurrence , 2004, Discovery Science.

[9]  Sven Meyer,et al.  The Suffix Tree Document Model Revisited , 1992 .

[10]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[11]  Chris Buckley,et al.  OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.

[12]  Viggo Kann,et al.  Comparing Comparisons: Document Clustering Evaluation Using Two Manual Classifications , 2004 .

[13]  Ellen M. Voorhees,et al.  Implementing agglomerative hierarchic clustering algorithms for use in document retrieval , 1986, Inf. Process. Manag..