Evaluating contents-link coupled web page clustering for web search results

Clustering is currently one of the most crucial techniques for dealing (e.g. resources locating, information interpreting) with massive amount of heterogeneous information on the web. Unlike clustering in other fields, web page clustering separates unrelated pages and clusters related pages (to a specific topic) into semantically meaningful groups, which is useful for discrimination, summarization, organization and navigation of unstructured web pages. We have proposed a contents-link coupled clustering algorithm that clusters web pages by combining contents and link analysis. In this paper, we particularly study the effects of out-links (from the web pages), in-links (to the web page) and terms on the final clustering results as well as how to effectively combine these three parts to improve the quality of clustering results. We apply it to cluster web search results. Preliminary experiments and evaluations are conducted on various topics. As the experimental results show, the proposed clustering algorithm is effective and promising.

[1]  Vipin Kumar,et al.  Partitioning-based clustering for Web document categorization , 1999, Decis. Support Syst..

[2]  Masaru Kitsuregawa,et al.  Link Based Clustering of Web Search Results , 2001, WAIM.

[3]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[4]  Piotr Indyk,et al.  Scalable Techniques for Clustering the Web , 2000, WebDB.

[5]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[6]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[7]  Chanathip Namprempre,et al.  HyPursuit: a hierarchical network search engine that exploits content-link hypertext clustering , 1996, HYPERTEXT '96.

[8]  Monika Henzinger,et al.  Finding Related Pages in the World Wide Web , 1999, Comput. Networks.

[9]  Anupam Joshi,et al.  Retriever: Improving Web Search Engine Results Using Clustering , 2000 .

[10]  Piotr Indyk,et al.  Similarity Search on the Web: Evaluation and Scalability Considerations , 2001 .

[11]  Peter Pirolli,et al.  Life, death, and lawfulness on the electronic frontier , 1997, CHI.

[12]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[13]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[14]  Jon M. Kleinberg,et al.  Inferring Web communities from link topology , 1998, HYPERTEXT '98.

[15]  Ravi Kumar,et al.  Trawling the Web for Emerging Cyber-Communities , 1999, Comput. Networks.

[16]  Oren Etzioni,et al.  Grouper: A Dynamic Clustering Interface to Web Search Results , 1999, Comput. Networks.

[17]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[18]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[19]  Oren Etzioni,et al.  Fast and Intuitive Clustering of Web Documents , 1997, KDD.

[20]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[21]  M. M. Kessler Bibliographic coupling between scientific papers , 1963 .

[22]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[23]  Yitong Wang,et al.  Use link-based clustering to improve Web search results , 2001, Proceedings of the Second International Conference on Web Information Systems Engineering.

[24]  Henry G. Small,et al.  Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..