Mining Web site's clusters from link topology and site hierarchy

Foraging information in large and complex Web sites simply using keyword search usually results in unpleasant experience due to the overloaded search results. To support more effective information search, some descriptive abstractions of the Web sites (e.g., sitemaps) are mostly needed. However, their creation and maintenance normally requires recurrent manual effort due to the fast-changing Web contents. We extend the HITS algorithm and integrate hyperlink topology and Web site hierarchy to identify a hierarchy of Web page clusters as the abstraction of a Web site. As the algorithm is based on HITS, each identified cluster follows the bipartite graph structure, with an authority and hub pair as the cluster summary. The effectiveness of the algorithm has been evaluated using three different Web sites (containing /spl sim/6000-14000 Web pages) with promising results. Detailed interpretation of the experimental results as well as qualitative comparison with other related works are also included.

[1]  Eli Upfal,et al.  Stochastic models for the Web graph , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[2]  David R. Karger,et al.  Constant interaction-time scatter/gather browsing of very large document collections , 1993, SIGIR.

[3]  Masaru Kitsuregawa,et al.  An Approach to Build a Cyber-Community Hierarchy , 2002 .

[4]  Jaideep Srivastava,et al.  Web usage mining: discovery and applications of usage patterns from Web data , 2000, SKDD.

[5]  Jaideep Srivastava,et al.  Automatic personalization based on Web usage mining , 2000, CACM.

[6]  Hendrik Blockeel,et al.  Web mining research: a survey , 2000, SKDD.

[7]  Krishna Bharat,et al.  Improved algorithms for topic distillation in a hyperlinked environment , 1998, SIGIR '98.

[8]  Ravi Kumar,et al.  Trawling the Web for Emerging Cyber-Communities , 1999, Comput. Networks.

[9]  Giuseppe Attardi,et al.  Automatic Web Page Categorization by Link and Context Analysis , 1999 .

[10]  Wen-Syan Li,et al.  Defining logical domains in a web site , 2000, HYPERTEXT '00.

[11]  Gary William Flake,et al.  Self-organization of the web and identification of communities , 2002 .

[12]  Hans-Peter Kriegel,et al.  Web site mining: a new way to spot competitors, customers and suppliers in the world wide web , 2002, KDD.

[13]  K. Selçuk Candan,et al.  Discovering Web Document Associations for Web Site Summarization , 2001, WWW Posters.

[14]  David M. Pennock,et al.  Using web structure for classifying and describing web pages , 2002, WWW.

[15]  Evangelos E. Milios,et al.  World Wide Web site summarization , 2004, Web Intell. Agent Syst..

[16]  C. Lee Giles,et al.  Self-Organization and Identification of Web Communities , 2002, Computer.

[17]  Oren Etzioni,et al.  Towards adaptive Web sites: Conceptual framework and case study , 1999, Artif. Intell..

[18]  Philip S. Yu,et al.  Discovering unexpected information from your competitors' web sites , 2001, KDD '01.