Quantitative Evaluation of Performance and Validity Indices for Clustering the Web Navigational Sessions

Clustering techniques are widely used in Web Usage Mining to capture similar interests and trends among users accessing a Web site. For this purpose, web access logs generated at a particular web site are preprocessed to discover the user navigational sessions. Clustering techniques are then applied to group the user session data into user session clusters, where intercluster similarities are minimized while the intra cluster similarities are maximized. Since the application of different clustering algorithms generally results in different sets of cluster formation, it is important to evaluate the performance of these methods in terms of accuracy and validity of the clusters, and also the time required to generate them, using appropriate performance measures. This paper describes various validity and accuracy measures including Dunn's Index, Davies Bouldin Index, C Index, Rand Index, Jaccard Index, Silhouette Index, Fowlkes Mallows and Sum of the Squared Error (SSE). We conducted the performance evaluation of the following clustering techniques: k-Means, k-Medoids, Leader, Single Link Agglomerative Hierarchical and DBSCAN. These techniques are implemented and tested against the Web user navigational data. Finally their performance results are presented and compared.

[1]  Benjamin King Step-Wise Clustering Procedures , 1967 .

[2]  Peter H. A. Sneath,et al.  Numerical Taxonomy: The Principles and Practice of Numerical Classification , 1973 .

[3]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[4]  Bamshad Mobasher,et al.  Data Mining for Web Personalization , 2007, The Adaptive Web.

[5]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[6]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[7]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[8]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[9]  Ting Su,et al.  Automated hierarchical mixtures of probabilistic principal component analyzers , 2004, ICML.

[10]  Fionn Murtagh,et al.  A Survey of Recent Advances in Hierarchical Clustering Algorithms , 1983, Comput. J..

[11]  J. Dunn Well-Separated Clusters and Optimal Fuzzy Partitions , 1974 .

[12]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[13]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[14]  Robert F. Ling,et al.  Cluster analysis algorithms for data reduction and classification of objects , 1981 .

[15]  Tao Luo,et al.  Discovery and Evaluation of Aggregate Usage Profiles for Web Personalization , 2004, Data Mining and Knowledge Discovery.

[16]  Jaideep Srivastava,et al.  Automatic personalization based on Web usage mining , 2000, CACM.

[17]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[19]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[20]  T. Ravindra Babu,et al.  Comparison of genetic algorithm based prototype selection schemes , 2001, Pattern Recognit..

[21]  E. Mizutani,et al.  Neuro-Fuzzy and Soft Computing-A Computational Approach to Learning and Machine Intelligence [Book Review] , 1997, IEEE Transactions on Automatic Control.

[22]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.