Indexing web access-logs for pattern queries

In this paper, we develop a new indexing method for large web access-logs. We are concerned with pattern queries, which advocate the search for access sequences that contain certain query patterns. This kind of queries find applications in processing web-log mining results (e.g., finding typical/atypical access-sequences). The proposed method focuses on scalability to web-logs' sizes. For this reason, we examine the gains due to signature-trees, which can further improve the scalability to very large web-logs. Experimental results illustrate the superiority of the proposed method.

[1]  Yannis Manolopoulos,et al.  Finding Generalized Path Patterns for Web Log Data Mining , 2000, ADBIS-DASFAA.

[2]  Tomasz Imielinski,et al.  MSQL: A Query Language for Database Mining , 1999, Data Mining and Knowledge Discovery.

[3]  Maciej Zakrzewicz Sequential Index Structure for Content-Based Retrieval , 2001, PAKDD.

[4]  Philip S. Yu,et al.  Efficient Data Mining for Path Traversal Patterns , 1998, IEEE Trans. Knowl. Data Eng..

[5]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[6]  Tadeusz Morzy,et al.  Group Bitmap Index: A Structure for Association Rules Retrieval , 1998, KDD.

[7]  Hiroyuki Kitagawa,et al.  Evaluation of signature files as set access facilities in OODBs , 1993, SIGMOD '93.

[8]  Yannis Manolopoulos,et al.  Efficient similarity search for market basket data , 2002, The VLDB Journal.

[9]  Myra Spiliopoulou,et al.  WUM - A Tool for WWW Ulitization Analysis , 1998, WebDB.

[10]  Kyuseok Shim,et al.  Data mining and the Web: past, present and future , 1999, WIDM '99.

[11]  Oren Etzioni,et al.  Adaptive Web Sites: an AI Challenge , 1997, IJCAI.

[12]  Jaideep Srivastava,et al.  Data Preparation for Mining World Wide Web Browsing Patterns , 1999, Knowledge and Information Systems.

[13]  Arbee L. P. Chen,et al.  Efficient theme and non-trivial repeating pattern discovering in music databases , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[14]  Jian Pei,et al.  Mining Access Patterns Efficiently from Web Logs , 2000, PAKDD.

[15]  Umeshwar Dayal,et al.  From User Access Patterns to Dynamic Hypertext Linking , 1996, Comput. Networks.

[16]  Yannis Manolopoulos,et al.  Improved Methods for Signature-Tree Construction , 2000, Comput. J..

[17]  C. Q. Lee,et al.  The Computer Journal , 1958, Nature.

[18]  Uwe Deppisch,et al.  S-tree: a dynamic balanced signature index for office retrieval , 1986, SIGIR '86.

[19]  Yannis Manolopoulos,et al.  A Data Mining Algorithm for Generalized Web Prefetching , 2003, IEEE Trans. Knowl. Data Eng..

[20]  Dennis Shasha,et al.  Lots o'Ticks: real time high performance time series queries on billions of trades and quotes , 2001, SIGMOD '01.

[21]  James E. Pitkow,et al.  In Search of Reliable Usage Data on the WWW , 1997, Comput. Networks.

[22]  G. Moerkotte,et al.  A Study of Four Index Structures for Set-Valued Attributes of Low Cardinality , 1999 .