Advanced data preprocessing for intersites Web usage mining

Web usage mining applies data mining procedures to analyze user access of Web sites. As with any KDD (knowledge discovery and data mining) process, WUM contains three main steps: preprocessing, knowledge extraction, and results analysis. We focus on data preprocessing, a fastidious, complex process. Analysts aim to determine the exact list of users who accessed the Web site and to reconstitute user sessions-the sequence of actions each user performed on the Web site. Intersites WUM deals with Web server logs from several Web sites, generally belonging to the same organization. Thus, analysts must reassemble the users' path through all the different Web servers that they visited. Our solution is to join all the log files and reconstitute the visit. Classical data preprocessing involves three steps: data fusion, data cleaning, and data structuration. Our solution for WUM adds what we call advanced data preprocessing. This consists of a data summarization step, which will allow the analyst to select only the information of interest. We've successfully tested our solution in an experiment with log files from INRIA Web sites.

[1]  Yongjian Fu,et al.  A Generalization-Based Approach to Clustering of Web Usage Sessions , 1999, WEBKDD.

[2]  Farnoush Banaei Kashani,et al.  A Framework for Efficient and Anonymous Web Usage Mining Based on Client-Side Tracking , 2001, WEBKDD.

[3]  Brigitte Trousse,et al.  Automatic Clustering for Web Usage Mining , 2003 .

[4]  Jaideep Srivastava,et al.  Grouping Web page references into transactions for mining World Wide Web browsing patterns , 1997, Proceedings 1997 IEEE Knowledge and Data Engineering Exchange Workshop.

[5]  Jaideep Srivastava,et al.  Web usage mining: discovery and application of interesting patterns from web data , 2000 .

[6]  Philip S. Yu,et al.  Data mining for path traversal patterns in a web environment , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.

[7]  Anupam Joshi,et al.  On Mining Web Access Logs , 2000, ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.

[8]  Myra Spiliopoulou,et al.  The Impact of Site Structure and User Environment on Session Reconstruction in Web Usage Analysis , 2002, WEBKDD.

[9]  Tao Luo,et al.  Discovery and Evaluation of Aggregate Usage Profiles for Web Personalization , 2004, Data Mining and Knowledge Discovery.

[10]  Jaideep Srivastava,et al.  Data Preparation for Mining World Wide Web Browsing Patterns , 1999, Knowledge and Information Systems.

[11]  Roy T. Fielding,et al.  Uniform Resource Identifiers (URI): Generic Syntax , 1998, RFC.

[12]  Myra Spiliopoulou,et al.  Analysis of navigation behaviour in web sites integrating multiple information systems , 2000, The VLDB Journal.