Improved Fuzzy C-Means Clustering of Web Usage Data with Genetic Algorithm

Clustering is one of the important functions in web usage mining. Web usage mining involves application of data mining techniques to discover usage patterns from the web data. Cluster analysis aims at identifying groups of similar objects and, therefore helps to discover distribution of patterns and interesting correlations in large data sets. These methods are not only major tools to uncover the underlying structures of a given data set, but also promising tools to uncover local input-output relations of a complex system. Fuzzy C-means (FCM) is one of the most widely used fuzzy clustering algorithms in real world applications. However there are two major limitations that exist in this method. The first is that a predefined number of clusters must be given in advance. The second is that the FCM technique can get stuck in sub-optimal solutions. In this paper,we have proposed a new framework to improve the web sessions’ cluster quality from fuzzy c-means clustering using Genetic Algorithm (GA). Initially the fuzzy c-means algorithm is used to cluster the user sessions. And in the second step, we have proposed a GA based refinement algorithm to improve the cluster quality. The proposed algorithm is tested with web access logs collected from the Internet Traffic Archive (ITA) and shows that refined initial starting points and post processing refinement of clusters indeed lead to improved solutions.