Performance Challenges in Object-Relational DBMSs

XML is rapidly becoming a popular data format. It can be expected that soon large volumes of XML data will exist. XML data is either produced manually (like html documents today), or it is generated by a new generation of software tools for the WWW and/or electronic data interchange (EDI). The purpose of this paper is to present the results of an initial study about storing and querying XML data. As a first step, this study was focussed on the use of relational database systems and on very simplistic schemes to store and query XML data. In other words, we would like to study how the simplest and most obvious approaches perform, before thinking about more sophisticated approaches. In general, numerous different options to store and query XML data exist. In addition to a relational database, XML data can be stored in a file system, an object-oriented database (e.g., Excelon), or a special-purpose (or semi-structured) system such as Lore (Stanford), Lotus Notes, or Tamino (Software AG). It is still unclear which of these options will ultimately find wide-spread acceptance. A file system could be used with very little effort to store XML data, but a file system would not provide any support for querying the XML data. Object-oriented database systems would allow to cluster XML elements and sub-elements; this feature might be useful for certain applications, but the current generation of object-oriented database systems is not mature enough to process complex queries on large databases. It is going to take even longer before special-purpose systems are mature. Even when using an RDBMS, there are many different ways to store XML data. One strategy is to ask the user or a system administrator in order to decide how XML elements are stored in relational tables. Such an approach is supported, e.g., by Oracle 8i. Another option is to infer from the DTDs of the XML documents how the XML elements should be mapped into tables; such an approach has been studied in [4]. Yet another option is to analyze the XML data and the expected query workload; such an approach has been devised, e.g., in [2]. In this work, we will only study very simple ad-hoc schemes; we think that such a study is necessary before adopting a more complex approach. The schemes that we analyze require no input by the user, they work in the absence of DTDs or if DTDs are meaningless, and they do not involve any analysis of the XML data. Due to their simplicity, the approaches we study will not show the best possible performance, but as we will see, some of them will show very good query performance in most situations. Also, there is no guarantee that any of the more sophisticated approaches known so far will perform better than our simple schemes; see [3] for some experimental results in this respect. Furthermore, the results of our study can be used as input for more sophisticated approaches.