PageCluster: Mining conceptual link hierarchies from Web log files for adaptive Web site navigation

User traversals on hyperlinks between Web pages can reveal semantic relationships between these pages. We use user traversals on hyperlinks as weights to measure semantic relationships between Web pages. On the basis of these weights, we propose a novel method to put Web pages on a Web site onto different conceptual levels in a link hierarchy. We develop a clustering algorithm called PageCluster, which clusters conceptually-related pages on each conceptual level of the link hierarchy based on their in-link and out-link similarities. Clusters are then used to construct a conceptual link hierarchy, which is visualized in a prototype called Online Navigation Explorer (ONE) for adaptive Web site navigation. Our experiments show that our method can put Web pages onto conceptual levels of a link hierarchy more accurately than both the breadth-first search method and the shortest-weighted-path method, and PageCluster can cluster conceptually-related pages more accurately than the bibliographic analysis method. Our user study also shows that the conceptual link hierarchy visualized in ONE can help users find information more effectively and efficiently as the task of finding information becomes less specific and involves more Web pages on multiple conceptual levels.

[1]  H. Van Dyke Parunak Ordering the information graph , 1991 .

[2]  Monika Henzinger,et al.  Finding Related Pages in the World Wide Web , 1999, Comput. Networks.

[3]  Peter Pirolli,et al.  Life, death, and lawfulness on the electronic frontier , 1997, CHI.

[4]  Henry G. Small,et al.  Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..

[5]  Ravi Kumar,et al.  Trawling the Web for Emerging Cyber-Communities , 1999, Comput. Networks.

[6]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[7]  M. M. Kessler Bibliographic coupling between scientific papers , 1963 .

[8]  David M. Pennock,et al.  Using web structure for classifying and describing web pages , 2002, WWW.

[9]  Johan Bollen,et al.  A system to restructure hypertext networks into valid user models , 1998, New Rev. Hypermedia Multim..

[10]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[11]  Jakob Nielsen,et al.  Designing web usability , 1999 .

[12]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[13]  Douglas Tudhope,et al.  Query-based navigation in semantically indexed hypermedia , 1997, HYPERTEXT '97.

[14]  J. Gower A General Coefficient of Similarity and Some of Its Properties , 1971 .

[15]  Brian D. Davison Topical locality in the Web , 2000, SIGIR '00.

[16]  Johannes Fürnkranz,et al.  Exploiting Structural Information for Text Classification on the WWW , 1999, IDA.

[17]  Monika Henzinger,et al.  Hyperlink Analysis for the Web , 2001, IEEE Internet Comput..

[18]  David K. Farkas,et al.  Guidelines for Designing Web Navigation , 2000 .

[19]  Oren Etzioni,et al.  Adaptive Web Sites: an AI Challenge , 1997, IJCAI.

[20]  James Chen,et al.  Adaptive hypertext navigation based on user goals and context , 1993, User Modeling and User-Adapted Interaction.

[21]  David R. Karger,et al.  Constant interaction-time scatter/gather browsing of very large document collections , 1993, SIGIR.

[22]  Francis Narin,et al.  Clustering of scientific journals , 1973, J. Am. Soc. Inf. Sci..

[23]  Phillip M. Hallam-Baker,et al.  Extended Log File Format , 1996, World Wide Web J..

[24]  Jeff Conklin,et al.  Hypertext: An Introduction and Survey , 1987, Computer.

[25]  Oren Etzioni,et al.  Adaptive Web Sites: Automatically Synthesizing Web Pages , 1998, AAAI/IAAI.

[26]  Oren Etzioni,et al.  Adaptive Web Sites: Conceptual Cluster Mining , 1999, IJCAI.

[27]  Pat Hanrahan,et al.  Interactive visualization of large graphs and networks , 2000 .

[28]  Peter Ingwersen,et al.  Informetric analyses on the world wide web: methodological approaches to 'webometrics' , 1997, J. Documentation.

[29]  Michael E. D. Koenig,et al.  Journal clustering using a bibliographic coupling method , 1977, Inf. Process. Manag..

[30]  Jonathan Hodgson Do HTML Tags Flag Semantic Content? , 2001, IEEE Internet Comput..

[31]  C. Lee Giles,et al.  Self-Organization and Identification of Web Communities , 2002, Computer.

[32]  D. Wishart,et al.  k-Means Clustering with Outlier Detection, Mixed Variables and Missing Values , 2003 .

[33]  David G. Durand,et al.  MAPA: a system for inducing and visualizing hierarchy in Websites , 1998, HYPERTEXT '98.

[34]  Huberman,et al.  Strong regularities in world wide web surfing , 1998, Science.

[35]  Louis B. Rosenfeld,et al.  Information architecture for the world wide web - designing large-scale web sites , 1998 .

[36]  Carolyn J. Crouch,et al.  The use of cluster hierarchies in hypertext information retrieval , 1989, Hypertext.

[37]  David M. Pennock,et al.  Inferring hierarchical descriptions , 2002, CIKM '02.

[38]  Ray R. Larson,et al.  Bibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace , 1996 .