On-Line Selectivity Estimation for XML Path Expressions using Markov Histograms

The extensible mark-up language (XML) is gaining widespread use as a format for data exchange and storage on the World Wide Web. Queries over XML data require accurate selectivity estimation of path expressions in order to optimize query execution plans. Selectivity estimation of XML path expression is usually done based on summary statistics about the structure of the underlying XML repository. All previous methods require an o-line scan of the XML repository to collect the statistics. In this paper, we propose XPathLearner 4 , a method for estimating selectivity of the most commonly used types of path expressions without looking at the XML data. XPathLearner gathers and refines the required statistics using query feedback in an on-line manner and is especially suited to queries in Internet scale applications since the underlying XML repository is either inaccessible or too large to be scanned in its entirety. Besides the on-line property, our method also has two other novel features: (a) XPathLearner is workload aware in collecting the statistics and thus can be more accurate than the more costly o-line method under tight memory constraints, and (b) XPathLearner automatically adjusts the statistics using query feedback when the underlying XML data change. We show empirically the estimation accuracy of our method using the XMark synthetic data set and several real data sets.

[1]  Neoklis Polyzotis,et al.  Structure and Value Synopses for XML Data Graphs , 2002, VLDB.

[2]  Donald D. Chamberlin,et al.  Access Path Selection in a Relational Database Management System , 1989 .

[3]  P. Krishnan,et al.  Estimating alphanumeric selectivity in the presence of wildcards , 1996, SIGMOD '96.

[4]  Steven J. DeRose,et al.  XML Path Language (XPath) , 1999 .

[5]  Luis Gravano,et al.  STHoles: a multidimensional workload-aware histogram , 2001, SIGMOD '01.

[6]  Roy Goldman,et al.  From Semistructured Data to XML: Migrating the Lore Data Model and Query Language , 1999, Markup Lang..

[7]  Surajit Chaudhuri,et al.  Self-tuning histograms: building histograms without looking at data , 1999, SIGMOD '99.

[8]  Divesh Srivastava,et al.  Counting twig matches in a tree , 2001, Proceedings 17th International Conference on Data Engineering.

[9]  Scott Boag,et al.  XQuery 1.0 : An XML Query Language , 2007 .

[10]  Jennifer Widom,et al.  Query Optimization for XML , 1999, VLDB.

[11]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[12]  Ronitt Rubinfeld,et al.  On the learnability of discrete distributions , 1994, STOC '94.

[13]  David J. DeWitt,et al.  The Niagara Internet Query System , 2001, IEEE Data Eng. Bull..

[14]  Divesh Srivastava,et al.  Substring selectivity estimation , 1999, PODS '99.

[15]  Neoklis Polyzotis,et al.  Statistical synopses for graph-structured XML databases , 2002, SIGMOD '02.

[16]  Jeffrey F. Naughton,et al.  Estimating the Selectivity of XML Path Expressions for Internet Scale Applications , 2001, VLDB.

[17]  Ioana Manolescu,et al.  XMark: A Benchmark for XML Data Management , 2002, VLDB.