Distributed Web Log Mining Using Maximal Large Itemsets

Abstract. We introduce a partitioning-based distributed document-clustering algorithm using user access patterns from multi-server web sites. Our algorithm makes it possible to exploit simultaneously adaptive document replication and persistent connections, two techniques that are most effective in decreasing the response time that is observed by web users. The algorithm first distributes the user access data evenly among the servers by using a hash function. Then, each server generates a local clustering on its fair share of the user sessions records by employing a traditional single-machine document-clustering algorithm. Finally, those local clustering results are combined together by using a novel procedure that generates maximal large itemsets of web documents. We present preliminary experimental results and discuss alternative approaches to be pursued in the future.

[1]  Jaideep Srivastava,et al.  Data Preparation for Mining World Wide Web Browsing Patterns , 1999, Knowledge and Information Systems.

[2]  E. Forgy,et al.  Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[3]  Jiawei Han,et al.  Discovering Web access patterns and trends by applying OLAP and data mining technology on Web logs , 1998, Proceedings IEEE International Forum on Research and Technology Advances in Digital Libraries -ADL'98-.

[4]  Peter Scheuermann,et al.  Web++: A System for Fast and Reliable Web Service , 1999, USENIX Annual Technical Conference, General Track.

[5]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[6]  Umeshwar Dayal,et al.  An Application of Adaptive Data Mining: Facilitating Web Information Access , 1997, DMKD.

[7]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[8]  Jaideep Srivastava,et al.  Creating adaptive Web sites through usage-based clustering of URLs , 1999, Proceedings 1999 Workshop on Knowledge and Data Engineering Exchange (KDEX'99) (Cat. No.PR00453).

[9]  L. R. Rasmussen,et al.  In information retrieval: data structures and algorithms , 1992 .

[10]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[11]  Clement T. Yu,et al.  Adaptive record clustering , 1985, TODS.

[12]  Roy T. Fielding,et al.  Hypertext Transfer Protocol - HTTP/1.0 , 1996, RFC.

[13]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[14]  Amit Aggarwal,et al.  RaDaR: A Scalable Architecture for a Global Web Hosting Service , 1999, Comput. Networks.

[15]  Hillol Kargupta,et al.  Distributed Clustering Using Collective Principal Component Analysis , 2001, Knowledge and Information Systems.

[16]  Michelle Butler,et al.  A Scalable HTTP Server: The NCSA Prototype , 1994, Comput. Networks ISDN Syst..

[17]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.