Knowledge Discovery from Web Usage Data: Complete Preprocessing Methodology

Summary The exponential growth of the Web in terms of Web sites and their users during the last decade has generated huge amount of data related to the user’s interactions with the Web sites. This data is recorded in the Web access log files of Web servers and usually referred as Web Usage Data (WUD). Knowledge Discovery from Web Usage Data (KDWUD) is that area of Web mining deals with the application of data mining techniques to extract interesting knowledge from the WUD. As Web sites continue to grow in size and complexity, the results of KDWUD have become very critical for efficient and effective management of the activities related to: e-business, eeducation, e-commerce, personalization, website design & management, network traffic analysis, the cache, the proxies, great diversity of Web pages in a site, search engine’s complexity, and to predict user’s actions. In this paper, we propose a complete preprocessing methodology, one of the important steps in KDWUD process. Several heuristics have been proposed for cleaning the WUD which is then aggregated and recorded in the relational data model. To validate the efficiency of the proposed preprocessing methodology, several experiments were conducted and the results shows that the proposed methodology reduces the size of Web access log files down to 73-82% of the initial size and offer richer logs that are structured for further stages of KDWUD.

[1]  Jaideep Srivastava,et al.  Web usage mining: discovery and applications of usage patterns from Web data , 2000, SKDD.

[2]  Myra Spiliopoulou,et al.  Analysis of navigation behaviour in web sites integrating multiple information systems , 2000, The VLDB Journal.

[3]  Farnoush Banaei Kashani,et al.  A Framework for Efficient and Anonymous Web Usage Mining Based on Client-Side Tracking , 2001, WEBKDD.

[4]  Tao Luo,et al.  Discovery and Evaluation of Aggregate Usage Profiles for Web Personalization , 2004, Data Mining and Knowledge Discovery.

[5]  Philip S. Yu,et al.  Efficient Data Mining for Path Traversal Patterns , 1998, IEEE Trans. Knowl. Data Eng..

[6]  Jaideep Srivastava,et al.  Creating adaptive Web sites through usage-based clustering of URLs , 1999, Proceedings 1999 Workshop on Knowledge and Data Engineering Exchange (KDEX'99) (Cat. No.PR00453).

[7]  Carolina Ruiz,et al.  FS-Miner: efficient and incremental mining of frequent sequence patterns in web logs , 2004, WIDM '04.

[8]  Anupam Joshi,et al.  On Mining Web Access Logs , 2000, ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.

[9]  Bettina Berendt,et al.  Web Usage Mining, Site Semantics, and the Support of Navigation , 2000 .

[10]  Yongjian Fu,et al.  A Generalization-Based Approach to Clustering of Web Usage Sessions , 1999, WEBKDD.

[11]  Myra Spiliopoulou,et al.  The Impact of Site Structure and User Environment on Session Reconstruction in Web Usage Analysis , 2002, WEBKDD.

[12]  Duncan Dubugras Alcoba Ruiz,et al.  A pre-processing tool for Web usage mining in the distance education domain , 2004, Proceedings. International Database Engineering and Applications Symposium, 2004. IDEAS '04..

[13]  Ron Kohavi,et al.  Ten Supplementary Analyses to Improve E-commerce Web Sites , 2003 .