Finding hot query patterns over an XQuery stream

Abstract.Caching query results is one efficient approach to improving the performance of XML management systems. This entails the discovery of frequent XML queries issued by users. In this paper, we model user queries as a stream of XML query pattern trees and mine the frequent query patterns over the query stream. To facilitate the one-pass mining process, we devise a novel data structure called DTS to summarize the pattern trees seen so far. By grouping the incoming pattern trees into batches, we can dynamically mark the active portion of the current batch in DTS and limit the enumeration of candidate trees to only the currently active pattern trees. We also design another summary data structure called ECTree that provides for the incremental computation of the frequent tree patterns over the query stream. Based on the above two constructs, we present two mining algorithms called XQSMinerI and XQSMinerII. XQSMinerI is fast, but it tends to overestimate, while XQSMinerII adopts a filter-and-refine approach to minimize the amount of overestimation. Experimental results show that the proposed methods are both efficient and scalable and require only small memory footprints.

[1]  Thomas Schwentick,et al.  XPath Containment in the Presence of Disjunction, DTDs, and Variables , 2003, ICDT.

[2]  Yossi Matias,et al.  DIMACS Series in Discrete Mathematicsand Theoretical Computer Science Synopsis Data Structures for Massive Data , 2007 .

[3]  Hannu T. T. Toivonen,et al.  Samplinglarge databases for finding association rules , 1996, VLDB 1996.

[4]  Rajeev Motwani,et al.  Chain: operator scheduling for memory minimization in data stream systems , 2003, SIGMOD '03.

[5]  Philip S. Yu,et al.  Mining concept-drifting data streams using ensemble classifiers , 2003, KDD '03.

[6]  Michael Stonebraker,et al.  Load Shedding in a Data Stream Manager , 2003, VLDB.

[7]  Michael Stonebraker,et al.  Operator Scheduling in a Data Stream Manager , 2003, VLDB.

[8]  Mong-Li Lee,et al.  Efficient Mining of XML Query Patterns for Caching , 2003, VLDB.

[9]  F. Luccio,et al.  Exact Rooted Subtree Matching in Sublinear Time , 2001 .

[10]  Peter T. Wood,et al.  Containment for XPath Fragments under DTD Constraints , 2003, ICDT.

[11]  Piotr Indyk,et al.  Maintaining stream statistics over sliding windows: (extended abstract) , 2002, SODA '02.

[12]  Christos Faloutsos,et al.  Adaptive, Hands-Off Stream Mining , 2003, VLDB.

[13]  Alexandre Termier,et al.  TreeFinder: a first step towards XML data mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[14]  David J. DeWitt,et al.  The Niagara Internet Query System , 2001, IEEE Data Eng. Bull..

[15]  Ke Wang,et al.  Discovering Structural Association of Semistructured Data , 2000, IEEE Trans. Knowl. Data Eng..

[16]  Samuel Madden,et al.  Continuously adaptive continuous queries over streams , 2002, SIGMOD '02.

[17]  Hiroki Arimura,et al.  Online algorithms for mining semi-structured data stream , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[18]  Dennis Shasha,et al.  Efficient elastic burst detection in data streams , 2003, KDD '03.

[19]  Frederick Reiss,et al.  TelegraphCQ: Continuous Dataflow Processing for an Uncertain World , 2003, CIDR.

[20]  Shamkant B. Navathe,et al.  An Efficient Algorithm for Mining Association Rules in Large Databases , 1995, VLDB.

[21]  Ramakrishnan Srikant,et al.  Mining quantitative association rules in large relational tables , 1996, SIGMOD '96.

[22]  Elke A. Rundensteiner,et al.  XCache: a semantic caching system for XML queries , 2002, SIGMOD '02.

[23]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[24]  Dennis Shasha,et al.  Algorithmics and applications of tree and graph searching , 2002, PODS.

[25]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[26]  Piotr Indyk,et al.  Maintaining Stream Statistics over Sliding Windows , 2002, SIAM J. Comput..

[27]  Rajeev Motwani,et al.  Sampling from a moving window over streaming data , 2002, SODA '02.

[28]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[29]  Rajeev Motwani,et al.  Approximate Frequency Counts over Data Streams , 2012, VLDB.

[30]  Moses Charikar,et al.  Finding frequent items in data streams , 2004, Theor. Comput. Sci..

[31]  Dan Suciu,et al.  Containment and equivalence for an XPath fragment , 2002, PODS.

[32]  Steven J. DeRose,et al.  XML Path Language (XPath) , 1999 .

[33]  Mong-Li Lee,et al.  Mining frequent query patterns from XML queries , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[34]  Rajeev Rastogi,et al.  Processing complex aggregate queries over data streams , 2002, SIGMOD '02.

[35]  Thomas Schwentick,et al.  XPath query containment , 2004, SGMD.

[36]  Christian Hidber,et al.  Association Rule Mining , 2017 .

[37]  Theodore Johnson,et al.  Gigascope: a stream database for network applications , 2003, SIGMOD '03.

[38]  Mohammed J. Zaki Efficiently mining frequent trees in a forest , 2002, KDD.

[39]  Jennifer Widom,et al.  Characterizing memory requirements for queries over continuous data streams , 2004, ACM Trans. Database Syst..

[40]  Rajeev Rastogi,et al.  Processing set expressions over continuous update streams , 2003, SIGMOD '03.

[41]  Hiroki Arimura,et al.  Optimized Substructure Discovery for Semi-structured Data , 2002, PKDD.

[42]  Yossi Matias,et al.  New sampling-based summary statistics for improving approximate query answers , 1998, SIGMOD '98.

[43]  Frederick Reiss,et al.  TelegraphCQ: continuous dataflow processing , 2003, SIGMOD '03.

[44]  Qiang Chen,et al.  Aurora : a new model and architecture for data stream management ) , 2006 .

[45]  Abhinandan Das,et al.  Approximate join processing over data streams , 2003, SIGMOD '03.

[46]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[47]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[48]  S. Muthukrishnan,et al.  Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries , 2001, VLDB.

[49]  Hiroki Arimura,et al.  Efficient Substructure Discovery from Large Semi-Structured Data , 2001, IEICE Trans. Inf. Syst..

[50]  Hannu Toivonen,et al.  Sampling Large Databases for Association Rules , 1996, VLDB.