Indexing and path query processing for xml data

XML has emerged as a new standard for information representation and exchange on the Internet. To efficiently process XML data, we propose the extended preorder numbering scheme, which determines the ancestor-descendant relationship between nodes in the hierarchy of XML data in constant time, and adapts to the dynamics of XML data by allocating extra space. Based on this numbering scheme, we propose sort-merge based algorithms, EA -Join and EE -Join, to process ancestor-descendant path expressions. The experimental results showed an order of magnitude performance improvement over conventional methods. We further propose the partition-based algorithms, which can be chosen by a query optimizer according to the characteristics of the input data. For complex path expressions with branches, we propose the Containment B+-tree (CB-tree) index and the IndexTwig algorithm. The CB-tree, which is an extension of the B+-tree, supports both the containment query and the reverse containment query. It is an effective indexing scheme for XML documents with or without a small number of recursions. The proposed IndexTwig algorithm works with any index supporting containment and reverse containment queries, such as the CB-tree. We also introduce a simplified output model, which outputs only the necessary result of a path expression. The output model enables the Fast Existence Test (FET) optimization to skip unnecessary data and avoid generating unwanted results. Also in this dissertation, we introduce techniques to process the predicates in XML path expressions using the EVR-tree. The EVR-tree combines the advantages of indexing on values or elements individually using B+-trees. It utilizes the high value selectivity and/or high structural selectivity, and provides ordered element access by using a priority queue. At the end of the dissertation, we introduce the XISS/R system, which is an implementation of the XML Indexing and Storage System (XISS) on top of a relational database. The XISS/R includes a web-based user interface and a XPath query engine to translate XPath queries into efficient SQL statements.

[1]  Roy Goldman,et al.  Lore: a database management system for semistructured data , 1997, SGMD.

[2]  Goetz Graefe,et al.  Sort versus Hash Revisited , 1994, IEEE Trans. Knowl. Data Eng..

[3]  David J. DeWitt,et al.  Relational Databases for Querying XML Documents: Limitations and Opportunities , 1999, VLDB.

[4]  Michael Ian Shamos,et al.  Computational geometry: an introduction , 1985 .

[5]  Letizia Tanca,et al.  XML-GL: A Graphical Language for Querying and Restructuring XML Documents , 1999, SEBD.

[6]  Divesh Srivastava,et al.  Holistic twig joins: optimal XML pattern matching , 2002, SIGMOD '02.

[7]  Jeffrey F. Naughton,et al.  Covering indexes for branching path queries , 2002, SIGMOD '02.

[8]  Daniela Florescu,et al.  Quilt: An XML Query Language for Heterogeneous Data Sources , 2000, WebDB.

[9]  C. M. Sperberg-McQueen,et al.  eXtensible Markup Language (XML) 1.0 (Second Edition) , 2000 .

[10]  Daniela Florescu,et al.  A Performance Evaluation of Alternative Mapping Schemes for Storing XML Data in a Relational Database , 1999 .

[11]  Dan Suciu,et al.  Optimizing regular path expressions using graph schemas , 1998, Proceedings 14th International Conference on Data Engineering.

[12]  Michael J. Franklin,et al.  Efficient Filtering of XML Documents for Selective Dissemination of Information , 2000, VLDB.

[13]  Dan Suciu,et al.  SilkRoute: trading between relations and XML , 2000, Comput. Networks.

[14]  Philip S. Yu,et al.  ViST: a dynamic index method for querying XML data by tree structures , 2003, SIGMOD '03.

[15]  Jeffrey F. Naughton,et al.  Efficient Sampling Strategies for Relational Database Operations , 1993, Theor. Comput. Sci..

[16]  Torsten Grust,et al.  Accelerating XPath evaluation in any RDBMS , 2004, TODS.

[17]  Hongjun Lu,et al.  PBiTree coding and efficient processing of containment joins , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[18]  Roy Goldman,et al.  LORE: a Lightweight Object REpository for semistructured data , 1996, SIGMOD '96.

[19]  Serge Abiteboul,et al.  Regular path queries with constraints , 1997, J. Comput. Syst. Sci..

[20]  Rajeev Motwani,et al.  Random sampling for histogram construction: how much is enough? , 1998, SIGMOD '98.

[21]  Armin B. Cremers,et al.  Searching and browsing collections of structural information , 2000, Proceedings IEEE Advances in Digital Libraries 2000.

[22]  Dan Suciu,et al.  Data on the Web: From Relations to Semistructured Data and XML , 1999 .

[23]  Haim Kaplan,et al.  Compact labeling schemes for ancestor queries , 2001, SODA '01.

[24]  Hongjun Lu,et al.  Holistic Twig Joins on Indexed XML Documents , 2003, VLDB.

[25]  Christian S. Jensen,et al.  Efficient evaluation of the valid-time natural join , 1994, Proceedings of 1994 IEEE 10th International Conference on Data Engineering.

[26]  Chun Zhang,et al.  Storing and querying ordered XML using a relational database system , 2002, SIGMOD '02.

[27]  Ioana Manolescu,et al.  Integrating Keyword Search into XML Query Processing , 2000, BDA.

[28]  Denilson Barbosa,et al.  The XML web: a first study , 2003, WWW '03.

[29]  Roy Goldman,et al.  DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases , 1997, VLDB.

[30]  Carlo Zaniolo,et al.  Efficient Structural Joins on Indexed XML Documents , 2002, VLDB.

[31]  Hao Zhang,et al.  Path sharing and predicate evaluation for high-performance XML filtering , 2003, TODS.

[32]  Michael J. Franklin,et al.  A Fast Index for Semistructured Data , 2001, VLDB.

[33]  Beng Chin Ooi,et al.  XR-tree: indexing XML data for efficient structural joins , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[34]  Jennifer Widom,et al.  Query Optimization for XML , 1999, VLDB.

[35]  Toshiyuki Amagasa,et al.  XRel: a path-based approach to storage and retrieval of XML documents using relational databases , 2001, ACM Trans. Internet Techn..

[36]  Jignesh M. Patel,et al.  Structural joins: a primitive for efficient XML query pattern matching , 2002, Proceedings 18th International Conference on Data Engineering.

[37]  Edith Cohen,et al.  Labeling dynamic XML trees , 2002, SIAM J. Comput..

[38]  Jennifer Widom,et al.  Indexing Semistructured Data , 1998 .

[39]  Vishu Krishnamurthy,et al.  Performance Challenges in Object-Relational DBMSs , 1999, IEEE Data Eng. Bull..

[40]  Hongjun Lu,et al.  Efficient Processing of Twig Queries with OR-Predicates. , 2004, ACM SIGMOD Conference.

[41]  Ioana Manolescu,et al.  XMark: A Benchmark for XML Data Management , 2002, VLDB.

[42]  Steven J. DeRose,et al.  XML Path Language (XPath) Version 1.0 , 1999 .

[43]  Juliana Freire,et al.  From XML schema to relations: a cost-based approach to XML storage , 2002, Proceedings 18th International Conference on Data Engineering.

[44]  Alin Deutsch,et al.  Storing semistructured data with STORED , 1999, SIGMOD '99.

[45]  David J. DeWitt,et al.  On supporting containment queries in relational database management systems , 2001, SIGMOD '01.

[46]  Haim Kaplan,et al.  A comparison of labeling schemes for ancestor queries , 2002, SODA '02.

[47]  Paul F. Dietz Maintaining order in a linked list , 1982, STOC '82.