Incremental sequence-based frequent query pattern mining from XML queries

Existing algorithms of mining frequent XML query patterns (XQPs) employ a candidate generate-and-test strategy. They involve expensive candidate enumeration and costly tree-containment checking. Further, most of existing methods compute the frequencies of candidate query patterns from scratch periodically by checking the entire transaction database, which consists of XQPs transferred from user query logs. However, it is not straightforward to maintain such discovered frequent patterns in real XML databases as there may be frequent updates that may not only invalidate some existing frequent query patterns but also generate some new frequent query patterns. Therefore, a drawback of existing methods is that they are rather inefficient for the evolution of transaction databases. To address above-mentioned problems, this paper proposes an efficient algorithm ESPRIT to mine frequent XQPs without costly tree-containment checking. ESPRIT transforms XML queries into sequences using a one-to-one mapping technique and mines the frequent sequences to generate frequent XQPs. We propose two efficient incremental algorithms, ESPRIT-i and ESPRIT-i+, to incrementally mine frequent XQPs. We devise several novel optimization techniques of query rewriting, cache lookup, and cache replacement to improve the answerability and the hit rate of caching. We have implemented our algorithms and conducted a set of experimental studies on various datasets. The experimental results demonstrate that our algorithms achieve high efficiency and scalability and outperform state-of-the-art methods significantly.

[1]  Yong Zhang,et al.  SCEND: An Efficient Semantic Cache to Adequately Explore Answerability of Views , 2006, WISE.

[2]  Kyuseok Shim,et al.  APEX: an adaptive path index for XML data , 2002, SIGMOD '02.

[3]  Umeshwar Dayal,et al.  FreeSpan: frequent pattern-projected sequential pattern mining , 2000, KDD '00.

[4]  Wanhong Xu,et al.  The Framework of an XML Semantic Caching System , 2005, WebDB.

[5]  Mong-Li Lee,et al.  Efficient Mining of XML Query Patterns for Caching , 2003, VLDB.

[6]  Yong Zhang,et al.  Incremental Mining of Frequent Query Patterns from XML Queries for Caching , 2006, Sixth International Conference on Data Mining (ICDM'06).

[7]  Mong-Li Lee,et al.  Mining frequent query patterns from XML queries , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[8]  Yonatan Aumann,et al.  Borders: An Efficient Algorithm for Association Generation in Dynamic Databases , 1999, Journal of Intelligent Information Systems.

[9]  Alexandre Termier,et al.  TreeFinder: a first step towards XML data mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[10]  Jiawei Han,et al.  BIDE: efficient mining of frequent closed sequences , 2004, Proceedings. 20th International Conference on Data Engineering.

[11]  Hannu Toivonen,et al.  Finding Frequent Substructures in Chemical Compounds , 1998, KDD.

[12]  Hamid Pirahesh,et al.  A Framework for Using Materialized XPath Views in XML Query Processing , 2004, VLDB.

[13]  Sushil Jajodia,et al.  Mining Temporal Relationships with Multiple Granularities in Time Sequences , 1998, IEEE Data Eng. Bull..

[14]  Jiawei Han,et al.  Efficient mining of partial periodic patterns in time series database , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[15]  Ramakrishnan Srikant,et al.  Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.

[16]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[17]  Charu C. Aggarwal,et al.  Xproj: a framework for projected structural clustering of xml documents , 2007, KDD '07.

[18]  Florent Masseglia,et al.  The PSP Approach for Mining Sequential Patterns , 1998, PKDD.

[19]  Hiroki Arimura,et al.  Efficient Substructure Discovery from Large Semi-Structured Data , 2001, IEICE Trans. Inf. Syst..

[20]  Xifeng Yan,et al.  CloSpan: Mining Closed Sequential Patterns in Large Datasets , 2003, SDM.

[21]  Mohammed J. Zaki Efficiently mining frequent trees in a forest: algorithms and applications , 2005, IEEE Transactions on Knowledge and Data Engineering.

[22]  George Karypis,et al.  Frequent subgraph discovery , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[23]  Elke A. Rundensteiner,et al.  XCache: a semantic caching system for XML queries , 2002, SIGMOD '02.

[24]  Umeshwar Dayal,et al.  PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth , 2001, ICDE 2001.

[25]  Bongki Moon,et al.  PRIX: indexing and querying XML using prufer sequences , 2004, Proceedings. 20th International Conference on Data Engineering.

[26]  Dan Suciu,et al.  Index Structures for Path Expressions , 1999, ICDT.

[27]  Andrew Lim,et al.  D(k)-index: an adaptive structural summary for graph-structured data , 2003, SIGMOD '03.

[28]  Jianyong Wang,et al.  Exploit sequencing to accelerate hot XML query pattern mining , 2006, SAC.

[29]  Mohammed J. Zaki Efficiently mining frequent trees in a forest , 2002, KDD.

[30]  Dan Suciu,et al.  Query Caching and View Selection for XML Databases , 2005, VLDB.

[31]  Hiroki Arimura,et al.  Optimized Substructure Discovery for Semi-structured Data , 2002, PKDD.

[32]  Joonho Kwon,et al.  FiST: Scalable XML Document Filtering by Sequencing Twig Patterns , 2005, VLDB.

[33]  Mong-Li Lee,et al.  2PXMiner: an efficient two pass mining of frequent XML query patterns , 2004, KDD.

[34]  Vagelis Hristidis,et al.  Semantic Caching of XML Databases , 2002, WebDB.

[35]  Jeffrey F. Naughton,et al.  Middle-tier database caching for e-business , 2002, SIGMOD '02.

[36]  Yi Chen,et al.  Incremental mining of frequent XML query patterns , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[37]  Johannes Gehrke,et al.  Sequential PAttern mining using a bitmap representation , 2002, KDD.

[38]  Philip S. Yu,et al.  ViST: a dynamic index method for querying XML data by tree structures , 2003, SIGMOD '03.

[39]  Ehud Gudes,et al.  Exploiting local similarity for indexing paths in graph-structured data , 2002, Proceedings 18th International Conference on Data Engineering.

[40]  Philip S. Yu,et al.  Mining long sequential patterns in a noisy environment , 2002, SIGMOD '02.

[41]  Ke Wang,et al.  Discovering Structural Association of Semistructured Data , 2000, IEEE Trans. Knowl. Data Eng..

[42]  Yong Zhang,et al.  Exploit sequencing views in semantic cache to accelerate xpath query evaluation , 2007, WWW '07.

[43]  Sridhar Ramaswamy,et al.  Cyclic association rules , 1998, Proceedings 14th International Conference on Data Engineering.