Synopses for XML

All existing proposals for querying XML (e.g., XQuery) rely on a pattern-specification language that allows path navigation and branching through the XML data graph in order to reach the desired data elements. Optimizing such queries depends crucially on the existence of concise synopsis structures that enable accurate compile-time selectivity estimates for complex path expressions over graph-structured XML data. In this paper, we summarize our main results from our recent work on XSKETCHes, a novel approach to building and using statistical summaries of large XML data graphs for effective path-expression selectivity estimation. Our proposed graph-synopsis model exploits localized graph stability to accurately approximate (in limited space) the path and branching distribution in the data graph. To estimate the selectivities of complex path expressions over concise XSKETCH synopses, we develop an estimation framework that relies on appropriate statistical (uniformity and independence) assumptions to compensate for the lack of detailed distribution information. Given our estimation framework, We demonstrate that the problem of building an accuracy-optimal XSKETCH for a given amount of space is NP-hard, and propose an efficient heuristic algorithm based on greedy forward selection. Extensive experimental results with synthetic as well as real-life data sets verify the effectiveness of our approach. To the best of our knowledge, ours is the first work to address this timely problem in the most general setting of graph-structured data and complex (branching) path expressions.

[1]  Jennifer Widom,et al.  Query Optimization for XML , 1999, VLDB.

[2]  Jeffrey Scott Vitter,et al.  Approximate computation of multidimensional aggregates of sparse data using wavelets , 1999, SIGMOD '99.

[3]  J. Wolfowitz Review: William Feller, An introduction to probability theory and its applications. Vol. I , 1951 .

[4]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[5]  Bennett Fox,et al.  Discrete Optimization Via Marginal Analysis , 1966 .

[6]  Steven J. DeRose,et al.  Xml linking language (xlink), version 1. 0 , 2000, WWW 2000.

[7]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[8]  David J. DeWitt,et al.  The Niagara Internet Query System , 2001, IEEE Data Eng. Bull..

[9]  Divesh Srivastava,et al.  Counting twig matches in a tree , 2001, Proceedings 17th International Conference on Data Engineering.

[10]  James Clark,et al.  XSL Transformations (XSLT) Version 1.0 , 1999 .

[11]  Diana Sommer,et al.  Log Linear Models And Logistic Regression , 2016 .

[12]  Hiroshi Ishikawa,et al.  XQL: A Query Language for XML Data , 1998, QL.

[13]  C. M. Sperberg-McQueen,et al.  eXtensible Markup Language (XML) 1.0 (Second Edition) , 2000 .

[14]  Steven J. DeRose,et al.  XML Path Language (XPath) Version 1.0 , 1999 .

[15]  Neoklis Polyzotis,et al.  Statistical synopses for graph-structured XML databases , 2002, SIGMOD '02.

[16]  Divesh Srivastava,et al.  Substring selectivity estimation , 1999, PODS '99.

[17]  Robert E. Tarjan,et al.  Three Partition Refinement Algorithms , 1987, SIAM J. Comput..

[18]  Ioana Manolescu,et al.  The XML benchmark project , 2001 .

[19]  Yannis E. Ioannidis,et al.  Selectivity Estimation Without the Attribute Value Independence Assumption , 1997, VLDB.

[20]  Jeffrey F. Naughton,et al.  Estimating the Selectivity of XML Path Expressions for Internet Scale Applications , 2001, VLDB.

[21]  Scott Boag,et al.  XQuery 1.0 : An XML Query Language , 2007 .

[22]  Dan Suciu,et al.  Index Structures for Path Expressions , 1999, ICDT.

[23]  Francesco M. Malvestuto,et al.  Approximating discrete probability distributions with decomposable models , 1991, IEEE Trans. Syst. Man Cybern..