Query Optimization for XML

XML is an emerging standard for data representation and exchange on the World-Wide Web. Due to the nature of information on the Web and the inherent flexibility of XML, we expect that much of the data encoded in XML will be semistructured: the data may be irregular or incomplete, and its structure may change rapidly or unpredictably. This paper describes the query processor of Lore, a DBMS for XML-based data supporting an expressive query language. We focus primarily on Lore's cost-based query optimizer. While all of the usual problems associated with cost-based query optimization apply to XML-based query languages, a number of additional problems arise, such as new kinds of indexing, more complicated notions of database statistics, and vastly different query execution strategies for different databases. We define appropriate logical and physical query plans, database statistics, and a cost model, and we describe plan enumeration including heuristics for reducing the large search space. Our optimizer is fully implemented in Lore and preliminary performance results are reported. This is a short version of the paper Query Optimization for Semistructured Data which is available at: http://www-db.stanford.edu/~mchughj/publications/qo.ps

[1]  Serge Abiteboul,et al.  Querying Semi-Structured Data , 1997, Encyclopedia of Database Systems.

[2]  David Schach,et al.  XML Query Language (XQL) , 1998, QL.

[3]  David Jordan,et al.  The Object Database Standard: ODMG 2.0 , 1997 .

[4]  Roy Goldman,et al.  From Semistructured Data to XML: Migrating the Lore Data Model and Query Language , 1999, Markup Lang..

[5]  Patrick E. O'Neil,et al.  Model 204 Architecture and Performance , 1987, HPTS.

[6]  C. M. Sperberg-McQueen,et al.  Extensible Markup Language (XML) , 1997, World Wide Web J..

[7]  Gregory Piatetsky-Shapiro,et al.  Accurate estimation of the number of tuples satisfying a condition , 1984, SIGMOD '84.

[8]  Jennifer Widom,et al.  Object exchange across heterogeneous information sources , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[9]  Dan Suciu,et al.  Optimizing regular path expressions using graph schemas , 1998, Proceedings 14th International Conference on Data Engineering.

[10]  Jennifer Widom,et al.  Indexing Semistructured Data , 1998 .

[11]  Patricia G. Selinger,et al.  Access path selection in a relational database management system , 1979, SIGMOD '79.

[12]  C. M. Sperberg-McQueen,et al.  Extensible markup language , 1997 .

[13]  Dan Suciu,et al.  A query language for a Web-site management system , 1997, SGMD.

[14]  Ronald Fagin,et al.  Extendible hashing—a fast access method for dynamic files , 1979, ACM Trans. Database Syst..

[15]  Jennifer Widom,et al.  Compile-Time Path Expansion in Lore , 1998 .

[16]  Louiqa Raschid,et al.  Semantic query optimization for object databases , 1997, Proceedings 13th International Conference on Data Engineering.

[17]  ZhaoHui Tang,et al.  A Cost Model for Clustered Object-Oriented Databases , 1995, VLDB.

[18]  R. G. Cattell The object database standard , 1994 .

[19]  Roy Goldman,et al.  DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases , 1997, VLDB.

[20]  Jennifer Widom,et al.  The Lorel query language for semistructured data , 1997, International Journal on Digital Libraries.

[21]  Clement T. Yu,et al.  Priniples of Database Query Processing for Advanced Applications , 1997 .

[22]  Jennifer Widom,et al.  Querying Semistructured Heterogeneous Information , 1995, J. Syst. Integr..

[23]  Duane Szafron,et al.  An extensible query optimizer for an objectbase management system , 1995, CIKM '95.

[24]  Dan Suciu,et al.  Catching the boat with Strudel: experiences with a Web-site management system , 1998, SIGMOD '98.

[25]  Elisa Bertino,et al.  On Modeling Cost Functions for Object-Oriented Databases , 1997, IEEE Trans. Knowl. Data Eng..

[26]  Peter Buneman,et al.  Semistructured data , 1997, PODS.

[27]  M. Tamer Özsu,et al.  Query Optimization and Execution Plan Generation in Object-Oriented Data Management Systems , 1995, IEEE Trans. Knowl. Data Eng..

[28]  C. M. Sperberg-McQueen,et al.  eXtensible Markup Language (XML) 1.0 (Second Edition) , 2000 .

[29]  Guido Moerkotte,et al.  Evaluating queries with generalized path expressions , 1996, SIGMOD '96.

[30]  David Maier,et al.  Development of an object-oriented DBMS , 1986, OOPLSA '86.

[31]  Roy Goldman,et al.  Lore: a database management system for semistructured data , 1997, SGMD.

[32]  Elisa Bertino,et al.  Query processing in a multimedia document system , 1988, TOIS.

[33]  Jennifer Widom,et al.  Query Optimization for Semistructured Data , 1997 .

[34]  SuciuDan,et al.  A query language and optimization techniques for unstructured data , 1996 .

[35]  Alin Deutsch,et al.  XML-QL: A Query Language for XML , 1998 .

[36]  Patrick Valduriez,et al.  Optimization of Nonrecursive Queries in OODBs , 1991, DOOD.

[37]  Donald D. Chamberlin,et al.  Access Path Selection in a Relational Database Management System , 1989 .

[38]  Goetz Graefe,et al.  Query evaluation techniques for large databases , 1993, CSUR.

[39]  ZhaoHui Tang,et al.  Cost-based Selection of Path Expression Processing Algorithms in Object-Oriented Databases , 1996, VLDB.

[40]  Clement T. Yu,et al.  Query Optimization in Object-Oriented Database Systems , 1990, DEXA.

[41]  R. G. G. Cattell,et al.  The Object Database Standard: ODMG-93 , 1993 .

[42]  Guido Moerkotte,et al.  A Blackboard Architecture for Query Optimization in Object Bases , 1993, VLDB.

[43]  Dan Suciu,et al.  A query language and optimization techniques for unstructured data , 1996, SIGMOD '96.