A Hybrid Approach to Web Usage Mining

With the large number of companies using the Internet to distribute and collect information, knowledge discovery on the web has become an important research area. Web usage mining, which is the main topic of this paper, focuses on knowledge discovery from the clicks in the web log for a given site (the so-called click-stream), especially on analysis of sequences of clicks. Existing techniques for analyzing click sequences have different drawbacks, i.e., either huge storage requirements, excessive I/O cost, or scalability problems when additional information is introduced into the analysis.In this paper we present a new hybrid approach for analyzing click sequences that aims to overcome these drawbacks. The approach is based on a novel combination of existing approaches, more specifically the Hypertext Probabilistic Grammar (HPG) and Click Fact Table approaches. The approach allows for additional information, e.g., user demographics, to be included in the analysis without introducing performance problems. The development is driven by experiences gained from industry collaboration. A prototype has been implemented and experiments are presented that show that the hybrid approach performs well compared to the existing approaches. This is especially true when mining sessions containing clicks with certain characteristics, i.e., when constraints are introduced. The approach is not limited to web log analysis, but can also be used for general sequence mining tasks.

[1]  Ralf Walther,et al.  The Data Webhouse Toolkit , 2001, Künstliche Intell..

[2]  Jaideep Srivastava,et al.  Data Preparation for Mining World Wide Web Browsing Patterns , 1999, Knowledge and Information Systems.

[3]  Clifford Stein,et al.  Introduction to Algorithms, 2nd edition. , 2001 .

[4]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[5]  Ramakrishnan Srikant,et al.  Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.

[6]  Philip S. Yu,et al.  SpeedTracer: A Web Usage Mining and Analysis Tool , 1998, IBM Syst. J..

[7]  Daniela Florescu,et al.  Quilt: An XML Query Language for Heterogeneous Data Sources , 2000, WebDB.

[8]  Torben Bach Pedersen,et al.  Analyzing clickstreams using subsessions , 2000, DOLAP '00.

[9]  Mark,et al.  Heuristics for Mining High Quality User Web NavigationPatternsJos , 1999 .

[10]  Mark Levene,et al.  A fine grained heuristic to capture web navigation patterns , 2000, SKDD.

[11]  Umeshwar Dayal,et al.  PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth , 2001, ICDE 2001.

[12]  José Luis Cabral de Moura Borges,et al.  A data mining model to capture user web navigation patterns , 2000 .

[13]  Jaideep Srivastava,et al.  Web mining: information and pattern discovery on the World Wide Web , 1997, Proceedings Ninth IEEE International Conference on Tools with Artificial Intelligence.

[14]  Mark Levene,et al.  A Probabilistic Approach to Navigation in Hypertext , 1999, Inf. Sci..

[15]  Alex G. Büchner Discovering Internet Marketing Intelligence through Web Log Mining , 2003 .

[16]  Mark Levene,et al.  Data Mining of User Navigation Patterns , 1999, WEBKDD.

[17]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[18]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[19]  P. Tan,et al.  WebSIFT : The Web Site Information Filter , 1999 .

[20]  Jian Pei,et al.  Mining Access Patterns Efficiently from Web Logs , 2000, PAKDD.