Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Data on the Internet is increasingly presented in XML format. This enables novel applications that pose queries over “all the XML data on the Internet.” Queries over XML data use path expressions to navigate through the structure of the data, and optimizing these queries requires estimating the selectivity of these path expressions. In this paper, we propose two techniques for estimating the selectivity of simple XML path expressions over complex large-scale XML data as would be handled by Internet-scale applications: path trees and Markov tables. Both techniques work by summarizing the structure of the XML data in a small amount of memory and using this summary for selectivity estimation. We experimentally demonstrate the accuracy of our proposed techniques, and explore the different situations that would favor one technique over the other. We also demonstrate that our proposed techniques are more accurate than the best previously known alternative.

[1]  Roy Goldman,et al.  Lore: a database management system for semistructured data , 1997, SGMD.

[2]  P. Krishnan,et al.  Estimating alphanumeric selectivity in the presence of wildcards , 1996, SIGMOD '96.

[3]  Asuman Dogac,et al.  A Heuristic Approach for Optimization of Path Expressions , 1995, DEXA.

[4]  ZhaoHui Tang,et al.  A Cost Model for Clustered Object-Oriented Databases , 1995, VLDB.

[5]  ZhaoHui Tang,et al.  Cost-based Selection of Path Expression Processing Algorithms in Object-Oriented Databases , 1996, VLDB.

[6]  Roy Goldman,et al.  From Semistructured Data to XML: Migrating the Lore Data Model and Query Language , 1999, Markup Lang..

[7]  David Megginson,et al.  Simple API for XML , 1998 .

[8]  Jeffrey D. Ullman,et al.  Representative objects: concise representations of semistructured, hierarchical data , 1997, Proceedings 13th International Conference on Data Engineering.

[9]  David J. DeWitt,et al.  On supporting containment queries in relational database management systems , 2001, SIGMOD '01.

[10]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[11]  S. Muthukrishnan,et al.  Selectively estimation for Boolean queries , 2000, PODS '00.

[12]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[13]  Yannis E. Ioannidis,et al.  Selectivity Estimation Without the Attribute Value Independence Assumption , 1997, VLDB.

[14]  Roy Goldman,et al.  DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases , 1997, VLDB.

[15]  David J. DeWitt,et al.  The Niagara Internet Query System , 2001, IEEE Data Eng. Bull..

[16]  J. Widom,et al.  Approximate DataGuides , 1998 .

[17]  Divesh Srivastava,et al.  Counting twig matches in a tree , 2001, Proceedings 17th International Conference on Data Engineering.

[18]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[19]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[20]  Divesh Srivastava,et al.  Substring selectivity estimation , 1999, PODS '99.

[21]  Dan Suciu,et al.  Index Structures for Path Expressions , 1999, ICDT.

[22]  Daniela Florescu,et al.  Quilt: An XML Query Language for Heterogeneous Data Sources , 2000, WebDB.

[23]  Jennifer Widom,et al.  Query Optimization for XML , 1999, VLDB.

[24]  Donald D. Chamberlin,et al.  XQuery: a query language for XML , 2003, SIGMOD '03.

[25]  Jeffrey F. Naughton,et al.  Generating Synthetic Complex-Structured XML Data , 2001, WebDB.

[26]  Divesh Srivastava,et al.  Multi-Dimensional Substring Selectivity Estimation , 1999, VLDB.

[27]  Steven J. DeRose,et al.  XML Path Language (XPath) Version 1.0 , 1999 .