Query Directed Web Page Clustering

Web page clustering methods categorize and organize search results into semantically meaningful clusters that assist users with search refinement; but finding clusters that are semantically meaningful to users is difficult. In this paper, we describe a new Web page clustering algorithm, QDC, which uses the user's query as part of a reliable measure of cluster quality. The new algorithm has five key innovations: a new query directed cluster quality guide that uses the relationship between clusters and the query, an improved cluster merging method that generates semantically coherent clusters by using cluster description similarity in additional to cluster overlap, a new cluster splitting method that fixes the cluster chaining or cluster drifting problem, an improved heuristic for cluster selection that uses the query directed cluster quality guide, and a new method of improving clusters by ranking the pages by relevance to the cluster. We evaluate QDC by comparing its clustering performance against that of four other algorithms on eight data sets (four use full text data and four use snippet data) by using eleven different external evaluation measurements. We also evaluate QDC by informally analysing its real world usability and performance through comparison with six other algorithms on four data sets. QDC provides a substantial performance improvement over other Web page clustering algorithms

[1]  Amanda Spink,et al.  Multitasking Web search on Vivisimo.com , 2005, International Conference on Information Technology: Coding and Computing (ITCC'05) - Volume II.

[2]  Dawid Weiss,et al.  A concept-driven algorithm for clustering search results , 2005, IEEE Intelligent Systems.

[3]  Susan T. Dumais,et al.  Bringing order to the Web: automatically categorizing search results , 2000, CHI.

[4]  Dawid Weiss,et al.  Lingo: Search Results Clustering Algorithm Based on Singular Value Decomposition , 2004, Intelligent Information Systems.

[5]  Churn-Jung Liau,et al.  Rough Sets and Soft Computing in Intelligent Agent and Web Technologies , 2005 .

[6]  Xiaoying Gao,et al.  Standardized evaluation method for Web clustering results , 2005, The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05).

[7]  Sven Meyer,et al.  The Suffix Tree Document Model Revisited , 1992 .

[8]  Paul M. B. Vitányi,et al.  Automatic Meaning Discovery Using Google , 2006, Kolmogorov Complexity and Applications.

[9]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[10]  Masaru Kitsuregawa,et al.  On Combining Link and Contents Information for Web Page Clustering , 2002, DEXA.

[11]  H. Bunke,et al.  A Comparison of Two Novel Algorithms for Clustering Web Documents , 2003 .

[12]  Charles Oppenheim,et al.  A model of cognitive load for IR: implications for user relevance feedback interaction , 2001 .

[13]  C. Cornelis,et al.  Fuzzy Rough Set Based Web Query Expansion , 2005 .

[14]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[15]  Joydeep Ghosh,et al.  Relationship-based clustering and cluster ensembles for high-dimensional data mining , 2002 .

[16]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[17]  Filippo Menczer,et al.  Lexical and semantic clustering by Web links , 2004, J. Assoc. Inf. Sci. Technol..

[18]  Xiaoying Gao,et al.  Improving Web clustering by cluster selection , 2005, The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05).

[19]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[20]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.