Mining the Structure of User Activity using Cluster Stability

Recent research has explored web user session clustering as a means of understanding user activity and interests on the World Wide Web. Though the proposed techniques have proven to be useful and effective, they require that one either specify the number of clusters in advance or browse a large hierarchy of clusters to find the optimal depth at which to describe user activity. In this paper, we examine the utility of a stability-based technique for automatically determining the optimal number of clusters in the context of web user session clustering. We present two case studies evaluating the technique’s effectiveness.

[1]  George Karypis,et al.  Concept Indexing: A Fast Dimensionality Reduction Algorithm With Applications to Document Retrieval and Categorization , 2000 .

[2]  Yongjian Fu,et al.  A Generalization-Based Approach to Clustering of Web Usage Sessions , 1999, WEBKDD.

[3]  Isabelle Guyon,et al.  A Stability Based Method for Discovering Structure in Clustered Data , 2001, Pacific Symposium on Biocomputing.

[4]  Jeffrey Heer,et al.  Separating the swarm: categorization methods for user sessions on the web , 2002, CHI.

[5]  Jiawei Han,et al.  Discovering Web access patterns and trends by applying OLAP and data mining technology on Web logs , 1998, Proceedings IEEE International Forum on Research and Technology Advances in Digital Libraries -ADL'98-.

[6]  W. Krzanowski,et al.  A Criterion for Determining the Number of Groups in a Data Set Using Sum-of-Squares Clustering , 1988 .

[7]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[8]  Jeffrey Heer,et al.  Identification of Web User Traffic Composition using Multi-Modal Clustering and Information Scent , 2000 .

[9]  Richard C. Dubes,et al.  Stability of a hierarchical clustering , 1980, Pattern Recognit..

[10]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[11]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[12]  Tian Zhang,et al.  BIRCH: A New Data Clustering Algorithm and Its Applications , 1997, Data Mining and Knowledge Discovery.

[13]  Arindam Banerjee,et al.  Clickstream clustering using weighted longest common subsequences , 2001 .

[14]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[15]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[16]  C. Mallows,et al.  A Method for Comparing Two Hierarchical Clusterings , 1983 .

[17]  Peter Pirolli,et al.  Distributions of surfers' paths through the World Wide Web: Empirical characterizations , 1999, World Wide Web.

[18]  Cyrus Shahabi,et al.  Knowledge discovery from users Web-page navigation , 1997, Proceedings Seventh International Workshop on Research Issues in Data Engineering. High Performance Database Management for Large-Scale Applications.