A Fuzzy Set Theoretic approach to discover user sessions from web navigational data

Due to the continuous increase in growth and complexity of WWW, web site publishers are facing increasing difficulty in attracting and retaining users. In order to design attractive web sites, designers must understand their users' needs. Therefore analysing navigational behaviour of users is an important part of web page design. Web Usage Mining (WUM) is the application of data mining techniques to web usage data in order to discover the patterns that can be used to analyse the user's navigational behaviour. Preprocessing, knowledge extraction and results analysis are the three main steps of WUM. Due to large amount of irrelevant information present in the web logs, the original log file can not be directly used in the WUM process. During the preprocessing stage of WUM raw web log data is to transformed into a set of user profiles. Each user profile captures a set of URLs representing a user session. This sessionized data can be used as the input for a variety of data mining tasks such as clustering, association rule mining, sequence mining etc. If the data mining task at hand is clustering, the session files are filtered to remove very small sessions in order to eliminate the noise from the data. But direct removal of these small sized sessions may result in loss of a significant amount of information specially when the number of small sessions is large. We propose a “Fuzzy Set Theoretic” approach to deal with this problem. Instead of directly removing all the small sessions below a specified threshold, we assign weights to all the sessions using a “Fuzzy Membership Function” based on the number of URLs accessed by the sessions. After assigning the weights we apply a “Fuzzy c-Mean Clustering” algorithm to discover the clusters of user profiles. In this paper, we provide a detailed review of various techniques to preprocess the web log data including data fusion, data cleaning, user identification and session identification. We also describe our methodology to perform feature selection (or dimensionality reduction) and session weight assignment tasks. Finally we compare our soft computing based approach of session weight assignment with the traditional hard computing based approach of small session elimination.

[1]  Yongjian Fu,et al.  A Generalization-Based Approach to Clustering of Web Usage Sessions , 1999, WEBKDD.

[2]  A. Joshi,et al.  Web mining: research and practice , 2004, Computing in Science & Engineering.

[3]  Jaideep Srivastava,et al.  Data Preparation for Mining World Wide Web Browsing Patterns , 1999, Knowledge and Information Systems.

[4]  James E. Pitkow,et al.  Characterizing Browsing Strategies in the World-Wide Web , 1995, Comput. Networks ISDN Syst..

[5]  B. Trousse,et al.  Data preprocessing for WUM , 2004, IEEE Potentials.

[6]  Jaideep Srivastava,et al.  Web usage mining: discovery and applications of usage patterns from Web data , 2000, SKDD.

[7]  Reynold Cheng,et al.  Uncertain Data Mining: An Example in Clustering Location Data , 2006, PAKDD.

[8]  Jaideep Srivastava,et al.  Grouping Web page references into transactions for mining World Wide Web browsing patterns , 1997, Proceedings 1997 IEEE Knowledge and Data Engineering Exchange Workshop.

[9]  Myra Spiliopoulou,et al.  Analysis of navigation behaviour in web sites integrating multiple information systems , 2000, The VLDB Journal.

[10]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[11]  Gerardo Beni,et al.  A Validity Measure for Fuzzy Clustering , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Myra Spiliopoulou,et al.  A Framework for the Evaluation of Session Reconstruction Heuristics in Web-Usage Analysis , 2003, INFORMS J. Comput..

[13]  Vipin Kumar,et al.  Discovery of Web Robot Sessions Based on their Navigational Patterns , 2004, Data Mining and Knowledge Discovery.

[14]  Ramana Rao,et al.  Silk from a sow's ear: extracting usable structures from the Web , 1996, CHI.

[15]  Tao Luo,et al.  Effective personalization based on association rule discovery from web usage data , 2001, WIDM '01.

[16]  Jaideep Srivastava,et al.  Automatic personalization based on Web usage mining , 2000, CACM.

[17]  Anupam Joshi,et al.  Robust Fuzzy Clustering Methods to Support Web Mining , 1998 .

[18]  Jaideep Srivastava,et al.  Web mining: information and pattern discovery on the World Wide Web , 1997, Proceedings Ninth IEEE International Conference on Tools with Artificial Intelligence.

[19]  Myra Spiliopoulou,et al.  Measuring the Accuracy of Sessionizers for Web Usage Analysis , 2001 .

[20]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[21]  Jaideep Srivastava,et al.  Web usage mining: discovery and application of interesting patterns from web data , 2000 .

[22]  Myra Spiliopoulou,et al.  The Impact of Site Structure and User Environment on Session Reconstruction in Web Usage Analysis , 2002, WEBKDD.

[23]  Brigitte Trousse,et al.  Advanced data preprocessing for intersites Web usage mining , 2004, IEEE Intelligent Systems.

[24]  Frank Klawonn,et al.  Fuzzy Clustering Based on Modified Distance Measures , 1999, IDA.