A Fast Index for Semistructured Data

Queries navigate semistructured data via path expressions, and can be accelerated using an index. Our solution encodes paths as strings, and inserts those strings into a special index that is highly optimized for long and complex keys. We describe the Index Fabric, an indexing structure that provides the efficiency and flexibility we need. We discuss how "raw paths" are used to optimize ad hoc queries over semistructured data, and how "refined paths" optimize specific access paths. Although we can use knowledge about the queries and structure of the data to create refined paths, no such knowledge is needed for raw paths. A performance study shows that our techniques, when implemented on top of a commercial relational database system, outperform the more traditional approach of using the commercial system’s indexing mechanisms to query the XML.

[1]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[2]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[3]  Douglas Comer,et al.  Ubiquitous B-Tree , 1979, CSUR.

[4]  Patrick Valduriez,et al.  Join indices , 1987, TODS.

[5]  Elisa Bertino,et al.  On the selection of optimal index configuration in OO databases , 1994, Proceedings of 1994 IEEE 10th International Conference on Data Engineering.

[6]  Elisa Bertino,et al.  Index configuration in object-oriented databases , 1994, VLDB J..

[7]  Jiawei Han,et al.  Join Index Hierarchies for Supporting Efficient Navigations in Object-Oriented Databases , 1994, VLDB.

[8]  Philip S. Yu,et al.  On Index Selection Schemes for Nested Object Hierarchies , 1994, VLDB.

[9]  D. Lee,et al.  Path Dictionary: A New Approach to Query Processing in Object-Oriented Databases , 1995 .

[10]  S. Sudarshan,et al.  Clustering Techniques for Minimizing External Path Length , 1996, VLDB.

[11]  Dan Suciu,et al.  A query language and optimization techniques for unstructured data , 1996, SIGMOD '96.

[12]  Guido Moerkotte,et al.  Evaluating queries with generalized path expressions , 1996, SIGMOD '96.

[13]  Ricardo A. Baeza-Yates,et al.  Integrating contents and structure in text retrieval , 1996, SGMD.

[14]  Roy Goldman,et al.  Lore: a database management system for semistructured data , 1997, SGMD.

[15]  Jennifer Widom,et al.  The Lorel query language for semistructured data , 1997, International Journal on Digital Libraries.

[16]  C. M. Sperberg-McQueen,et al.  Extensible Markup Language (XML) , 1997, World Wide Web J..

[17]  Serge Abiteboul,et al.  Querying Semi-Structured Data , 1997, Encyclopedia of Database Systems.

[18]  Roy Goldman,et al.  DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases , 1997, VLDB.

[19]  Dan Suciu,et al.  Optimizing regular path expressions using graph schemas , 1998, Proceedings 14th International Conference on Data Engineering.

[20]  Wang-Chien Lee,et al.  Dictionary: A New Access Method for Query Processing in Object-Oriented Databases , 1998, IEEE Trans. Knowl. Data Eng..

[21]  Donald E. Knuth,et al.  The art of computer programming, volume 3: (2nd ed.) sorting and searching , 1998 .

[22]  Jennifer Widom,et al.  Query Optimization for XML , 1999, VLDB.

[23]  David J. DeWitt,et al.  Relational Databases for Querying XML Documents: Limitations and Opportunities , 1999, VLDB.

[24]  Alin Deutsch,et al.  Storing semistructured data with STORED , 1999, SIGMOD '99.

[25]  Daniela Florescu,et al.  A Performance Evaluation of Alternative Mapping Schemes for Storing XML Data in a Relational Database , 1999 .

[26]  Daniela Florescu,et al.  Quilt: An XML Query Language for Heterogeneous Data Sources , 2000, WebDB.

[27]  C. M. Sperberg-McQueen,et al.  eXtensible Markup Language (XML) 1.0 (Second Edition) , 2000 .

[28]  Divesh Srivastava,et al.  On effective multi-dimensional indexing for strings , 2000, SIGMOD '00.

[29]  Giovanni Manzini,et al.  An experimental study of a compressed index , 2001, Inf. Sci..

[30]  Giovanni Manzini,et al.  An experimental study of an opportunistic index , 2001, SODA '01.