KeyX: selective key-oriented indexing in native XML-databases

In the world of Relational Database Management Systems (RDBMS) indexes are used to accelerate specific queries. The selection of indexes is an important task in database-tuning which is performed by a database administrator or an index selection tool which suggests a set of suitable indexes. In this paper we transfer the concept of specific indexes to XML Database Management Systems (XDBMS) and present an implementation that uses occurring queries to optimize the performance of an XML database system by automatically creating suitable indexes . We introduce an index approach, called key oriented XML index, that uses specific XML element values and attribute values as keys referencing arbitrary nodes in the data. We transfer the wellknown Index Selection Problem (ISP) to XDBMS. Solving the ISP, a workload of database operations is analyzed and a set of specific indexes that minimizes the total execution time is suggested. Because the ISP is an NP complete problem, we apply heuristics to find a solution with reduced complexity. Experimental results of the prototypical implementation of the key oriented XML indexes on top of a native XDBMS demonstrate that our approach significantly improves the query execution time with only moderate additional storage requirements. Because the workload is analyzed periodically and suitable indexes are created or dropped automatically by solving the ISP, our approach guarantees high performance over the total life time of a database. Published: in [41] Title: A selective key-oriented XML Index for the Index Selection Problem in XDBMS Authors: B. C. Hammerschmidt, M. Kempa and V. Linnemann Abstract: In relational database management systems indexes are used to accelerate specific queries. The selection of indexes is an important task when tuning a database which is performed by a database administrator or an index propagation tool which suggests a set of suitable indexes. In this paper we introduce a new index approach, called key-oriented XML index (KeyX), that uses specific XML element or attribute values as keys referencing arbitrary nodes in the XML data. KeyX is selective to specific queries avoiding efforts spent for elements which are never queried. This concept reduces memory consumption and unproductive index updates. We transfer the Index Selection Problem (ISP) to XDBMS. Applying the ISP, a workload of database operations is analyzed and a set of selective indexes that minimizes the total execution time for the workload is suggested. Because the workload is analyzed periodically and suitable indexes are created or dropped automatically our implementation of KeyX guarantees high performance over the total life time of a database. Published: in [44] 10.4. LIST OF PUBLICATIONS 165 Title: Comparisons and Performance Measurements of XML Index Structures Authors: B. C. Hammerschmidt, M. Kempa and V. Linnemann Abstract: Indexes are used to accelerate queries in database management systems (DBMS). In relational DBMS indexes are broadly explored whereas indexes in XML DBMS are still an active field of research. A multitude of approaches with different characteristics were introduced in the past. Approaches that are not selective to specific queries require the whole XML data to be indexed and may lead to enormous space consumption and poor performance if changes to the XML data occur often. With KeyX we have introduced a selective and key-oriented approach for indexing only relevant parts of XML data in a database. This work provides qualitative comparisons and performance measurements of recent approaches in XML indexing. We motivate why key-oriented indexing that is derived from the relational world performs as well in the XML context. Published: in [42] Title: Autonomous Index Optimization in XML Databases Authors: B. C. Hammerschmidt, M. Kempa and V. Linnemann Abstract: Defining suitable indexes is a major task when optimizing a database. Usually, a human database administrator defines a set of indexes in the design phase of the database. This can be done manually or with the help of so called index wizard tools analyzing predefined database operations. Even having an optimal initial set of indexes when setting up a database, there is no guarantee that these indexes will suit future demands. Rather, it is realistic that the typical usage of the database will change after a while because new queries appear, for instance. In consequence, the existing indexes are suboptimal. The typical way to handle this problem is that a database administrator maintains the database permanently. In XML database management systems (XDBMS) this problem becomes even worse: Because XML queries cover both content and structure the number of possible queries and indexes is significantly higher. Additionally, for XML data without schema information, queries and indexes cannot be defined in advance, because the structure and the content of the data is not restricted. Both facts tend to result in higher maintenance costs for XML indexes compared to relational indexes. In this paper we show by performance measurements that an adaptive XDBMS that analyzes its workload periodically and creates/drops XML indexes automatically guarantees a high performance over the total life time of a database. Although we present our index system called KeyX the idea and the results are transferable to other XML indexing approaches. Published: in [45] 166 CHAPTER 10. APPENDIX Title: The Index Update Problem for XML Data in XDBMS Authors: B. C. Hammerschmidt, M. Kempa and V. Linnemann Abstract: Database Management Systems are a major component of almost every information system. In relational Database Management Systems (RDBMS) indexes are well known and essential for the performant execution of frequent queries. For XML Database Management Systems (XDBMS) no index standards are established yet; although they are required not less. An inevitable side effect of any index is that modifications of the indexed data have to be reflected by the index structure itself. This leads to two problems: first it has to be determined whether a modifying operation affects an index or not. Second, if an index is affected, the index has to be updated efficiently best without rebuilding the whole index. In recent years a lot of approaches were introduced for indexing XML data in an XDBMS. All approaches lack more or less in the field of updates. In this paper we give an algorithm that is based on finite automaton theory and determines whether an XPath based database operation affects an index that is defined universally upon keys, qualifiers and a return value of an XPath expression. In addition, we give algorithms how we update our KeyX indexes efficiently if they are affected by a modification. The Index Update Problem is relevant for all applications that use a secondary XML data representation (e.g. indexes, caches, XML replication/synchronization services) where updates must be identified and realized. Published: in [47] Title: XDLT: A Distance Learning Tool for consistent teaching of XML and related Technologies Authors: B. C. Hammerschmidt, P. Stursberg, J. Jungclaus and V. Linnemann Abstract: The eXtended Markup Language (XML) has become an important data format in the e-learning world during the past years. A multitude of e-learning systems take advantage of XML for various purposes: to represent knowledge or content, for information exchange between distributed applications or just for platform-independent storage of data. Although XML reflects a technical issue of data representation and application architecture in most cases, an emerging need for students and teachers to learn XML and XML related technologies can be observed. For instance, a person who describes entities of a given domain with an XML-based ontology needs domain-specific knowledge and a certain degree of XML skills to express the knowledge. Current approaches to learn XML such as tutorials and XML editors lack in the field of guidance, monitoring of the learning process and interoperability of different XML related technologies like XML data modeling (DTD), XML transformation and query as well as update languages (XPath, XUpdate). With this paper we introduce a web-based distance teaching and learning system teaching fundamentals of XML and major XML related technologies. In contrast to interactive tutorials that operate mostly with fixed XML examples and XML editors which offer no guidance for the learner, our approach enables a student to learn XML and related technologies based on custom data and exercises that can be defined and monitored by a teacher. Published: in [48] 10.4. LIST OF PUBLICATIONS 167 Title: On the Intersection of XPath Expressions Authors: B. C. Hammerschmidt, M. Kempa and V. Linnemann Abstract: XPath is a common language for selecting nodes in an XML document. XPath uses so called path expressions which describe a navigation path through semistructured data. In the last years some of the characteristics of XPath have been discussed. Examples include the containment of two XPath expressions p and p (p ⊆ p). To the best of our knowledge the intersection of two XPath expressions (p ∩ p) has not been treated yet. The intersection of p and p is the set that contains all XML nodes that are selected both by p and p. In the context of indexes in XML databases the emptiness of the intersection of p and p is a major issue when updating the index. In order to keep the index consistent to the indexed data, it has to be detected if an index that is defined upon p is affected by a modifying database operation with the path expression p. In this paper we introduce the intersection problem for XPath and give a motivation for its relevance. We present an efficient intersection algorithm for XPath expressions without the NOT operator that is based on finite automata. For expressions that contain the NOT operator the intersection problem becomes NP -complet

[1]  Wolfgang Meier,et al.  eXist: An Open Source Native XML Database , 2002, Web, Web-Services, and Database Systems.

[2]  Jeffrey F. Naughton,et al.  Covering indexes for branching path queries , 2002, SIGMOD '02.

[3]  David J. DeWitt,et al.  Mixed Mode XML Query Processing , 2003, VLDB.

[4]  Thomas Schwentick,et al.  XPath Containment in the Presence of Disjunction, DTDs, and Variables , 2003, ICDT.

[5]  Alin Deutsch,et al.  Containment and Integrity Constraints for XPath Fragments , 2001 .

[6]  Stephen A. Cook,et al.  The complexity of theorem-proving procedures , 1971, STOC.

[7]  Surajit Chaudhuri,et al.  Microsoft index turning wizard for SQL Server 7.0 , 1998, SIGMOD '98.

[8]  Jeffrey F. Naughton,et al.  On the integration of structure indexes and inverted lists , 2004, Proceedings. 20th International Conference on Data Engineering.

[9]  Jack P. Gelb System-Managed Storage , 1989, IBM Syst. J..

[10]  Benjamin C. Pierce,et al.  XDuce: A statically typed XML processing language , 2003, TOIT.

[11]  Dan Suciu,et al.  Containment and equivalence for a fragment of XPath , 2004, JACM.

[12]  Aske Simon Christensen,et al.  Extending Java for high-level Web service construction , 2002, TOPL.

[13]  Hector J. Levesque,et al.  Hard and Easy Distributions of SAT Problems , 1992, AAAI.

[14]  Masatoshi Yoshikawa,et al.  An XML indexing structure with relative region coordinate , 2001, Proceedings 17th International Conference on Data Engineering.

[15]  Christian Kirkegaard,et al.  Static analysis of XML transformations in Java , 2003, IEEE Transactions on Software Engineering.

[16]  Jeffrey F. Naughton,et al.  XML-SQL Query Translation Literature: The State of the Art and Open Problems , 2003, Xsym.

[17]  Nils Klarlund,et al.  DSD: A schema language for XML , 2000, FMSP '00.

[18]  not Cwi,et al.  XHTML™ 1.0 The Extensible HyperText Markup Language , 2002 .

[19]  Murali Mani,et al.  Taxonomy of XML schema languages using formal language theory , 2005, TOIT.

[20]  Surajit Chaudhuri,et al.  An Efficient Cost-Driven Index Selection Tool for Microsoft SQL Server , 1997, VLDB.

[21]  Robert Richards,et al.  Universal Description, Discovery, and Integration (UDDI) , 2006 .

[22]  Rudolf Bayer,et al.  Multidimensional Mapping and Indexing of XML , 2003, BTW.

[23]  Philip S. Yu,et al.  ViST: a dynamic index method for querying XML data by tree structures , 2003, SIGMOD '03.

[24]  Surajit Chaudhuri,et al.  An overview of query optimization in relational systems , 1998, PODS.

[25]  L. Khachiyan Polynomial algorithms in linear programming , 1980 .

[26]  Peter Murray-Rust Chemical Markup Language: A Simple Introduction to Structured Documents , 1997, World Wide Web J..

[27]  Cong Yu,et al.  TIMBER: A native XML database , 2002, The VLDB Journal.

[28]  Hilary Putnam,et al.  A Computing Procedure for Quantification Theory , 1960, JACM.

[29]  Torsten. Grust,et al.  Accelerating XPath location steps , 2002, SIGMOD '02.

[30]  François Bry,et al.  Content and structure in indexing and ranking XML , 2004, WebDB '04.

[31]  Maarten Marx,et al.  XPath with Conditional Axis Relations , 2004, EDBT.

[32]  Nabil Layaïda,et al.  Containment of XPath expressions: an inference and rewriting based approach , 2003, Extreme Markup Languages®.

[33]  Dan Suciu,et al.  Data on the Web: From Relations to Semistructured Data and XML , 1999 .

[34]  Thomas Schwentick,et al.  XPath query containment , 2004, SGMD.

[35]  D. Box,et al.  Simple object access protocol (SOAP) 1.1 , 2000 .

[36]  Elliotte Rusty Harold,et al.  XML in a Nutshell , 2001 .

[37]  Pavel Zezula,et al.  Processing XML Queries with Tree Signatures , 2003, Intelligent Search on XML Data.

[38]  Roy Goldman,et al.  DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases , 1997, VLDB.

[39]  Volker Linnemann,et al.  On the intersection of XPath expressions , 2005, 9th International Database Engineering & Application Symposium (IDEAS'05).

[40]  Rajesh Bordawekar,et al.  XJ: integration of XML processing into java , 2004, WWW Alt. '04.

[41]  Dan Suciu,et al.  Index Structures for Path Expressions , 1999, ICDT.

[42]  Volker Linnemann,et al.  On the Index Selection Problem applied to Key oriented XML Indexes , 2004 .

[43]  Alfred V. Aho,et al.  The Design and Analysis of Computer Algorithms , 1974 .

[44]  Volker Linnemann,et al.  XDLT: A Distance Learning Tool for consistent teaching of XML and related Technologies , 2005 .

[45]  Hamid Pirahesh,et al.  System RX: one part relational, one part XML , 2005, SIGMOD '05.

[46]  Volker Linnemann,et al.  The Index Update Problem for XML Data in XDBMS , 2005, ICEIS.

[47]  Ramez Elmasri,et al.  Fundamentals of Database Systems , 1989 .

[48]  Kyuseok Shim,et al.  APEX: an adaptive path index for XML data , 2002, SIGMOD '02.

[49]  Ioana Manolescu,et al.  A Benchmark for XML Data Management , 2002 .

[50]  Sven Helmer,et al.  Anatomy of a native XML base management system , 2002, The VLDB Journal.

[51]  François Bry,et al.  Symmetry in XPath , 2002 .

[52]  Anura Gurugé,et al.  Universal Description, Discovery, and Integration , 2004 .

[53]  Roy Goldman,et al.  Lore: a database management system for semistructured data , 1997, SGMD.

[54]  Kai-Uwe Sattler,et al.  QUIET: Continuous Query-driven Index Tuning , 2003, VLDB.

[55]  Sven Helmer,et al.  Natix: A Technology Overview , 2002, Web, Web-Services, and Database Systems.

[56]  Benjamin C. Pierce,et al.  Regular Object Types , 2003, ECOOP.

[57]  Matteo Fischetti,et al.  Exact and Approximate Algorithms for the Index Selection Problem in Physical Database Design , 1995, IEEE Trans. Knowl. Data Eng..

[58]  Jennifer Widom,et al.  Indexing Semistructured Data , 1998 .

[59]  Volker Linnemann,et al.  Autonomous Index Optimization in XML Databases , 2005, 21st International Conference on Data Engineering Workshops (ICDEW'05).

[60]  Raymond K. Wong,et al.  A fast and versatile path index for querying semi-structured data , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[61]  Andrew Lim,et al.  D(k)-index: an adaptive structural summary for graph-structured data , 2003, SIGMOD '03.

[62]  Gerhard Weikum,et al.  HOPI: An Efficient Connection Index for Complex XML Document Collections , 2004, EDBT.

[63]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[64]  Volker Linnemann,et al.  A Selective Key-Oriented XML Index for the Index Selection Problem in XDBMS , 2004, DEXA.

[65]  Douglas Comer,et al.  The difficulty of optimum index selection , 1978, TODS.

[66]  Wassilios Kazakos,et al.  Datenbanken und XML - Konzepte, Anwendungen, Systeme , 2002, Xpert.press.

[67]  David J. DeWitt,et al.  The Niagara Internet Query System , 2001, IEEE Data Eng. Bull..

[68]  Georg Gottlob,et al.  The complexity of XPath query evaluation , 2003, PODS.

[69]  Daniel C. Zilio,et al.  DB2 advisor: an optimizer smart enough to recommend its own indexes , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[70]  François Bry,et al.  Content-Aware DataGuides for Indexing Large Collections of XML Documents , 2003 .

[71]  L. G. H. Cijan A polynomial algorithm in linear programming , 1979 .

[72]  Ioana Manolescu,et al.  XMark: A Benchmark for XML Data Management , 2002, VLDB.

[73]  Dan Suciu,et al.  Containment and equivalence for an XPath fragment , 2002, PODS.

[74]  Forouzan Golshani,et al.  Proceedings of the Eighth International Conference on Data Engineering , 1992 .

[75]  Volker Linnemann,et al.  Type Checking in XOBE , 2003, BTW.

[76]  Jozef Kratica,et al.  A Genetic Algorithm for the Index Selection Problem , 2003, EvoWorkshops.

[77]  Scott Boag,et al.  XQuery 1.0 : An XML Query Language , 2007 .

[78]  Harald Schöning Tamino - A DBMS designed for XML , 2001, ICDE.

[79]  Beda Christoph Hammerschmidt KeyX: ein selektiver schlüsselorientierter Index für das Index Selection Problem in XDBMS , 2004, Grundlagen von Datenbanken.

[80]  Xml Db Initiative XUpdate-XML Update Language , 2003 .

[81]  Bruce W. Perry Java Servlet & JSP Cookbook , 2003 .

[82]  Peter T. Wood,et al.  Containment for XPath Fragments under DTD Constraints , 2003, ICDT.

[83]  Michael J. Franklin,et al.  A Fast Index for Semistructured Data , 2001, VLDB.

[84]  Laks V. S. Lakshmanan,et al.  On Testing Satisfiability of Tree Pattern Queries , 2004, VLDB.

[85]  Beng Chin Ooi,et al.  XR-tree: indexing XML data for efficient structural joins , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[86]  Charles F. Goldfarb,et al.  SGML handbook , 1990 .

[87]  Surajit Chaudhuri,et al.  AutoAdmin “what-if” index analysis utility , 1998, SIGMOD '98.

[88]  Jan Hidders Satisfiability of XPath Expressions , 2003, DBPL.

[89]  Donald R. Morrison,et al.  PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric , 1968, J. ACM.