A Distributed Clustering Algorithm for Web-Based Access Patterns

ABSTRACT We introduce a distributed document clustering algorithm based on user access patterns for multi-server Web sites. Our algorithm makes it possible to exploit simultaneously adaptive document replication and persistent connections, two techniques that are most e ective in decreasing the response time that is observed by Web users. The algorithm rst distributes the user access data evenly among the servers by using a hash function. Then, each server generates a local clustering on its fair share of the user sessions records by employing a traditional single-machine document clustering algorithm. Finally, those local clustering results are combined together by using a novel procedure that generates maximal large itemsets of Web documents. We present preliminary experimental results and discuss alternative approaches to be pursued in the future.

[1]  Umeshwar Dayal,et al.  An Application of Adaptive Data Mining: Facilitating Web Information Access , 1997, DMKD.

[2]  Michelle Butler,et al.  A Scalable HTTP Server: The NCSA Prototype , 1994, Comput. Networks ISDN Syst..

[3]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[4]  Peter Scheuermann,et al.  Web++: A System for Fast and Reliable Web Service , 1999, USENIX Annual Technical Conference, General Track.

[5]  Clement T. Yu,et al.  Adaptive record clustering , 1985, TODS.

[6]  Jaideep Srivastava,et al.  Creating adaptive Web sites through usage-based clustering of URLs , 1999, Proceedings 1999 Workshop on Knowledge and Data Engineering Exchange (KDEX'99) (Cat. No.PR00453).

[7]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[8]  Roy T. Fielding,et al.  Hypertext Transfer Protocol - HTTP/1.1 , 1997, RFC.

[9]  Venkata Subramaniam,et al.  Information Retrieval: Data Structures & Algorithms , 1992 .

[10]  Amit Aggarwal,et al.  RaDaR: A Scalable Architecture for a Global Web Hosting Service , 1999, Comput. Networks.

[11]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[12]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[13]  Jiawei Han,et al.  Discovering Web access patterns and trends by applying OLAP and data mining technology on Web logs , 1998, Proceedings IEEE International Forum on Research and Technology Advances in Digital Libraries -ADL'98-.

[14]  E. Forgy,et al.  Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[15]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).