Exploring similarity among Web pages using the hyperlink structure

Hyperlinks inside HTML pages contain a wealth of information about the relationships among Web pages. Given a set of Web pages, we can explore the hyperlink relationships among these pages. This paper first provides formal definitions of hyperlink relations. We then use the notations to define similarity between two Web pages and between two sets of Web pages. For each one of them, we provide several definitions of similarity using forward and backward links. The similarity measure gives us a number between 0 and 1. We also demonstrate how to use the similarity measure to study clustering within a set of pages and to determine the "diversity" of a set of Web pages.

[1]  Jon M. Kleinberg,et al.  Applications of linear algebra in information retrieval and hypertext analysis , 1999, PODS '99.

[2]  Stephen Huang Improving Retrieval by Querying and Examining Prestige , 2002 .

[3]  Rick Kazman,et al.  Searching and visualizing the web through connectivity , 1997, The Web Conference.

[4]  Lada A. Adamic,et al.  Evolutionary Dynamics of the World Wide Web , 1999 .

[5]  Alberto O. Mendelzon,et al.  What do the Neighbours Think? Computing Web Page Reputations , 2000, IEEE Data Eng. Bull..

[6]  Dan Klein,et al.  Evaluating strategies for similarity search on the web , 2002, WWW '02.

[7]  Monika Henzinger,et al.  Finding Related Pages in the World Wide Web , 1999, Comput. Networks.

[8]  Krishna Bharat,et al.  Improved algorithms for topic distillation in a hyperlinked environment , 1998, SIGIR '98.

[9]  Piotr Indyk,et al.  Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.

[10]  Paul Douglas,et al.  Proceedings International Conference on Information Technology: Coding and Computing , 2002, Proceedings. International Conference on Information Technology: Coding and Computing.

[11]  Rick Kazman,et al.  WebQuery: Searching and Visualizing the Web Through Connectivity , 1997, Comput. Networks.

[12]  Alberto O. Mendelzon,et al.  An Autonomous Page Ranking Method for Metasearch Engines , 2002, WWW 2002.

[13]  Ophir Frieder,et al.  Information Retrieval: Algorithms and Heuristics , 1998 .

[14]  Dell Zhang,et al.  An efficient algorithm to rank Web resources , 2000, Comput. Networks.

[15]  Laks V. S. Lakshmanan,et al.  A declarative language for querying and restructuring the Web , 1996, Proceedings RIDE '96. Sixth International Workshop on Research Issues in Data Engineering.

[16]  Elaine Toms,et al.  Measuring the reputation of web sites: a preliminary exploration , 2001, JCDL '01.

[17]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .