Discovering critical edge sequences in E-commerce catalogs

Web sites allow the collection of vast amounts of navigational data -- clickstreams of user traversals through the site. These massive data stores offer the tantalizing possibility of uncovering interesting patterns within the dataset. For e-businesses, always looking for an edge in the hyper-competitive online marketplace, this possibility is of particular interest. Of significant particular interest to e-businesses is the discovery of Critical Edge Sequences (CES), which denote frequently traversed subpaths in the catalog. CESs can be used to improve site performance and site management, increase the effectiveness of advertising on the site, and gather additional knowledge of customer interest patterns on the site.Using traditional graph-based and web mining strategies to find CESs could turn out to be expensive in both space and time. In this paper, we propose a method to compute the most popular paths bewteen node pairs in a catalog, which are then used to discover CESs. Our method is both space-efficient and accurate, providing a vast reduction in the storage requirement with a minimum impact on accuracy. This algorithm, executed off-line in batch mode, is also practical with respect to running time. As a variant of single-source shortest-path, it runs in log linear time.

[1]  Bongki Moon,et al.  A case for parallelism in data warehousing and OLAP , 1998, Proceedings Ninth International Workshop on Database and Expert Systems Applications (Cat. No.98EX130).

[2]  Krithi Ramamritham,et al.  Dynamic content acceleration: a caching solution to enable scalable dynamic Web page generation , 2001, SIGMOD '01.

[3]  Shamkant B. Navathe,et al.  An architecture to support scalable online personalization on the Web , 2001, The VLDB Journal.

[4]  Christos Faloutsos,et al.  Quantifiable data mining using ratio rules , 2000, The VLDB Journal.

[5]  Jaideep Srivastava,et al.  Web Mining: Pattern Discovery from World Wide Web Transactions , 1996 .

[6]  Shamkant B. Navathe,et al.  A Model to Support E-Catalog Integration , 2001, DS-9.

[7]  Krithi Ramamritham,et al.  A Comparative Study of Alternative Middle Tier Caching Solutions to Support Dynamic Web Content Acceleration , 2001, VLDB.

[8]  Krithi Ramamritham,et al.  Indexing and Compression in Data Warehouses , 1999, DMDW.

[9]  Krithi Ramamritham,et al.  Enabling scalable online personalization on the Web , 2000, EC '00.

[10]  Alfred V. Aho,et al.  The Design and Analysis of Computer Algorithms , 1974 .

[11]  Krithi Ramamritham,et al.  Curio: A Novel Solution for Efficient Storage and Indexing in Data Warehouses , 1999, VLDB.

[12]  Shamkant B. Navathe,et al.  FUSION: a system allowing dynamic Web service composition and automatic execution , 2003, EEE International Conference on E-Commerce, 2003. CEC 2003..

[13]  Roberto J. Bayardo,et al.  Mining the most interesting rules , 1999, KDD '99.

[14]  Vijay Kumar,et al.  An adaptive location management algorithm for mobile computing , 1997, Proceedings of 22nd Annual Conference on Local Computer Networks.

[15]  Myra Spiliopoulou,et al.  WUM: A tool for Web Utilization analysis , 1999 .

[16]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[17]  Lars Schmidt-Thieme,et al.  Mining Web Navigation Path Fragments , 2002 .

[18]  Anindya Datta,et al.  Cooperative problem solving in distributed decision making contexts , 1991, Conference Proceedings 1991 IEEE International Conference on Systems, Man, and Cybernetics.

[19]  Sharma Chakravarthy,et al.  An architecture and two new research problems in ARCS databases , 1996, CIKM '96.

[20]  Douglas R. Shier,et al.  On algorithms for finding the k shortest paths in a network , 1979, Networks.

[21]  Anindya Datta,et al.  Databases for Active Rapidly Changing data Systems (ARCS): Augmenting Real-Time Databases with Temporal and Active Characteristics , 1996, RTDB.

[22]  Anindya Datta,et al.  A scalable approach for broadcasting data in a wireless network , 2001, MSWIM '01.

[23]  Anindya Datta,et al.  Buffer Management in Active, Real-Time Database Systems - Concepts and an Algorithm , 1997, ARTDB.

[24]  Suresha,et al.  Proxy-based acceleration of dynamically generated content on the world wide web: an approach and implementation , 2002, SIGMOD '02.

[25]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[26]  Ramakrishnan Srikant,et al.  Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.

[27]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[28]  Philip S. Yu,et al.  Data mining for path traversal patterns in a web environment , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.

[29]  Carlos R. Cunha,et al.  Determining WWW user's next access and its application to pre-fetching , 1997, Proceedings Second IEEE Symposium on Computer and Communications.

[30]  Jian Pei,et al.  ApproxMAP: Approximate Mining of Consensus Sequential Patterns , 2003, SDM.

[31]  Yannis Manolopoulos,et al.  . EFFECTIVE PREDICTION OF WEB-USER ACCESSES: A DATA MINING APPROACH , 2001 .

[32]  Ewan Klein,et al.  Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics , 2000, ACL 2000.

[33]  Chris Jermaine,et al.  A Novel Index Supporting High Volume Data Warehouse Insertion , 1999, VLDB.

[34]  David Eppstein,et al.  Finding the k Shortest Paths , 1999, SIAM J. Comput..

[35]  Debra E. VanderMeer,et al.  Toward a Comprehensive Model of the Content and Structure of, and User Interaction over a Web Site , 2000 .

[36]  Anindya Datta,et al.  Accessing Data in Block-Compressed Data Warehouses , 1999 .

[37]  P. Chatterjee,et al.  Modeling the Clickstream: Implications for Web-Based Advertising Efforts , 2003 .

[38]  Ayhan Demiriz,et al.  webSPADE: a parallel sequence mining algorithm to analyze web log data , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[39]  SpiliopoulouMyra,et al.  Data Mining for Measuring and Improving the Success of Web Sites , 2001 .

[40]  Krithi Ramamritham,et al.  User Action Recovery in Internet SAGAs (iSAGAs) , 2001, TES.

[41]  Mark Levene,et al.  Data Mining of User Navigation Patterns , 1999, WEBKDD.

[42]  Dan Suciu,et al.  Optimization of Run-time Management of Data Intensive Web-sites , 1999, VLDB.

[43]  Edsger W. Dijkstra,et al.  A note on two problems in connexion with graphs , 1959, Numerische Mathematik.

[44]  Peter Pirolli,et al.  Mining Longest Repeating Subsequences to Predict World Wide Web Surfing , 1999, USENIX Symposium on Internet Technologies and Systems.

[45]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[46]  Shamkant B. Navathe,et al.  Toward a Comprehensive Model of the Content and Structure, and User Interaction of a Web Site , 2000, Technologies for E-Services.

[47]  Mohammed J. Zaki,et al.  PlanMine: Sequence Mining for Plan Failures , 1998, KDD.