Structural joins: a primitive for efficient XML query pattern matching

XML queries typically specify patterns of selection predicates on multiple elements that have some specified tree structured relationships. The primitive tree structured relationships are parent-child and ancestor-descendant, and finding all occurrences of these relationships in an XML database is a core operation for XML query processing. We develop two families of structural join algorithms for this task: tree-merge and stack-tree. The tree-merge algorithms are a natural extension of traditional merge joins and the multi-predicate merge joins, while the stack-tree algorithms have no counterpart in traditional relational join processing. We present experimental results on a range of data and queries using the TIMBER native XML query engine built on top of SHORE. We show that while, in some cases, tree-merge algorithms can have performance comparable to stack-tree algorithms, in many cases they are considerably worse. This behavior is explained by analytical results that demonstrate that, on sorted inputs, the stack-tree algorithms have worst-case I/O and CPU complexities linear in the sum of the sizes of inputs and output, while the tree-merge algorithms do not have the same guarantee.

[1]  Ricardo A. Baeza-Yates,et al.  Integrating contents and structure in text retrieval , 1996, SGMD.

[2]  Paul Y. S. Cheung,et al.  Fuzzy-attribute graph with application to Chinese character recognition , 1992, IEEE Trans. Syst. Man Cybern..

[3]  Daniela Florescu,et al.  Quilt: An XML Query Language for Heterogeneous Data Sources , 2000, WebDB.

[4]  Vishu Krishnamurthy,et al.  Performance Challenges in Object-Relational DBMSs , 1999, IEEE Data Eng. Bull..

[5]  Jennifer Widom,et al.  Object exchange across heterogeneous information sources , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[6]  David J. DeWitt,et al.  Shoring up persistent applications , 1994, SIGMOD '94.

[7]  Michael J. Carey,et al.  A performance evaluation of pointer-based joins , 1990, SIGMOD '90.

[8]  Menzo Windhouwer,et al.  Querying XML documents made easy: nearest concept queries , 2001, Proceedings 17th International Conference on Data Engineering.

[9]  PiraheshHamid,et al.  Efficiently publishing relational data as XML documents , 2001, VLDB 2001.

[10]  Alin Deutsch,et al.  XML-QL: A Query Language for XML , 1998 .

[11]  B. N. Rossiterf,et al.  Strengths and Weaknesses of Database Models for Textual Documents , .

[12]  Roy Goldman,et al.  LORE: a Lightweight Object REpository for semistructured data , 1996, SIGMOD '96.

[13]  Kaizhong Zhang,et al.  On the Editing Distance between Undirected Acyclic Graphs and Related Problems , 1995, CPM.

[14]  Kaizhong Zhang,et al.  Approximate Tree Matching in the Presence of Variable Length Don't Cares , 1994, J. Algorithms.

[15]  Ivar Jacobson,et al.  The unified modeling language reference manual , 2010 .

[16]  Torsten Schlieder Strukturelle Ähnlichkeitssuche in XML-Dokumenten , 2000, Grundlagen von Datenbanken.

[17]  Ricardo A. Baeza-Yates,et al.  A model and a visual query language for structured text , 1998, Proceedings. String Processing and Information Retrieval: A South American Symposium (Cat. No.98EX207).

[18]  Kaizhong Zhang,et al.  On the Editing Distance Between Unordered Labeled Trees , 1992, Inf. Process. Lett..

[19]  Sophie Cluet,et al.  Using YAT to Build a Web Server , 1998, WebDB.

[20]  Dan Suciu,et al.  SilkRoute: trading between relations and XML , 2000, Comput. Networks.

[21]  Nina Wacholder,et al.  Extracting Names from Natural-Language Text , 2000 .

[22]  Donald D. Chamberlin,et al.  XQuery: a query language for XML , 2003, SIGMOD '03.

[23]  Kuo-Chung Tai,et al.  The Tree-to-Tree Correction Problem , 1979, JACM.

[24]  Ross Wilkinson,et al.  Using the cosine measure in a neural network for document retrieval , 1991, SIGIR '91.

[25]  Torsten Schlieder,et al.  Result Ranking for Structured Queries against XML Documents , 2000, DELOS.

[26]  Ming-Ling Lo,et al.  Spatial hash-joins , 1996, SIGMOD '96.

[27]  Dennis Shasha,et al.  Algorithmics and applications of tree and graph searching , 2002, PODS.

[28]  Sung-Hyon Myaeng,et al.  A flexible model for retrieval of SGML documents , 1998, SIGIR '98.

[29]  Goetz Graefe,et al.  Query evaluation techniques for large databases , 1993, CSUR.

[30]  Thomas Schwentick,et al.  Expressive and efficient pattern languages for tree-structured data (extended abstract) , 2000, PODS '00.

[31]  Michael J. Carey,et al.  XPERANTO: Middleware for Publishing Object-Relational Data as XML Documents , 2000, VLDB.

[32]  Alejandro Buchmann Datenbanksysteme in Büro, Technik und Wissenschaft (BTW), GI-Fachtagung, Freiburg, 1.-3. März 1999, Proceedings , 1999, BTW.

[33]  Kaizhong Zhang,et al.  Comparing multiple RNA secondary structures using tree comparisons , 1990, Comput. Appl. Biosci..

[34]  H. V. Jagadish,et al.  Evaluating Structural Similarity in XML Documents , 2002, WebDB.

[35]  Torsten Schlieder,et al.  Schema-Driven Evaluation of ApproXQL Queries , 2002 .

[36]  Vineet Bafna,et al.  Pattern Matching Algorithms , 1997 .

[37]  Gerard Salton,et al.  Automatic Information Organization And Retrieval , 1968 .

[38]  Dan Suciu,et al.  Index Structures for Path Expressions , 1999, ICDT.

[39]  Torsten Schlieder Schema-Driven Evaluation of Approximate Tree-Pattern Queries , 2002, EDBT.

[40]  Stanley M. Selkow,et al.  The Tree-to-Tree Editing Problem , 1977, Inf. Process. Lett..

[41]  Daniela Florescu,et al.  Storing and Querying XML Data using an RDMBS , 1999, IEEE Data Eng. Bull..

[42]  Balachander Krishnamurthy,et al.  Focusing search in hierarchical structures with directory sets , 1998, CIKM '98.

[43]  Kuo-Chung Tai,et al.  Syntactic Error Correction in Programming Languages , 1978, IEEE Trans. Software Eng..

[44]  Guido Moerkotte,et al.  Evaluating Queries on Structure with eXtended Access Support Relations , 2000, WebDB.

[45]  Torsten Schlieder,et al.  Querying and ranking XML documents , 2002, J. Assoc. Inf. Sci. Technol..

[46]  Luis Gravano,et al.  Top-k selection queries over relational databases: Mapping strategies and performance evaluation , 2002, TODS.

[47]  Torsten Schlieder Similarity Search in XML Data using Cost-Based Query Transformations , 2001, WebDB.

[48]  GraefeGoetz Query evaluation techniques for large databases , 1993 .

[49]  Weiguo Fan,et al.  Effective information retrieval using genetic algorithms based matching functions adaptation , 2000, Proceedings of the 33rd Annual Hawaii International Conference on System Sciences.

[50]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[51]  Guido Moerkotte,et al.  YAXQL : A powerful and web-aware query language supporting query reuse and data integration , 2000 .

[52]  Gerhard Weikum,et al.  The Index-Based XXL Search Engine for Querying XML Data with Relevance Ranking , 2002, EDBT.

[53]  David J. DeWitt,et al.  Partition based spatial-merge join , 1996, SIGMOD '96.

[54]  Forouzan Golshani,et al.  Proceedings of the Eighth International Conference on Data Engineering , 1992 .

[55]  David J. DeWitt,et al.  On supporting containment queries in relational database management systems , 2001, SIGMOD '01.

[56]  Felix Naumann,et al.  Approximate tree embedding for querying XML data , 2000 .

[57]  David J. DeWitt,et al.  Relational Databases for Querying XML Documents: Limitations and Opportunities , 1999, VLDB.

[58]  Donald D. Chamberlin XQuery: An XML query language , 2002, IBM Syst. J..

[59]  Glenn Shafer,et al.  A Mathematical Theory of Evidence , 2020, A Mathematical Theory of Evidence.

[60]  Nick Koudas,et al.  Size separation spatial join , 1997, SIGMOD '97.

[61]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[62]  Torsten Schlieder,et al.  ApproXQL: Design and Implementation of an Approximate Pattern Matching Language for XML , 2001 .

[63]  Kaizhong Zhang A New Editing based Distance between Unordered Labeled Trees , 1993, CPM.

[64]  Armin B. Cremers,et al.  Searching and browsing collections of structural information , 2000, Proceedings IEEE Advances in Digital Libraries 2000.

[65]  Charles L. A. Clarke,et al.  Schema-Independent Retrieval from Heterogeneous Structured Text , 1994 .

[66]  Gerhard Weikum,et al.  Adding Relevance to XML , 2000, WebDB.

[67]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[68]  Hamid Pirahesh,et al.  Efficiently publishing relational data as XML documents , 2001, The VLDB Journal.

[69]  Edward A. Fox,et al.  Research Contributions , 2014 .

[70]  Robert R. Korfhage,et al.  Query Improvement in Information Retrieval Using Genetic Algorithms - A Report on the Experiments of the TREC Project , 1992, TREC.

[71]  Laks V. S. Lakshmanan,et al.  Querying network directories , 1999, SIGMOD '99.

[72]  Ricardo A. Baeza-Yates,et al.  A language for queries on structure and contents of textual databases , 1995, SIGIR '95.

[73]  George A. Miller,et al.  Length-Frequency Statistics for Written English , 1958, Inf. Control..

[74]  Roy Goldman,et al.  Lore: a database management system for semistructured data , 1997, SGMD.

[75]  David J. DeWitt,et al.  An Evaluation of Non-Equijoin Algorithms , 1991, VLDB.

[76]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[77]  Ricardo A. Baeza-Yates,et al.  Proximal nodes: a model to query document databases by content and structure , 1997, TOIS.

[78]  Jennifer Widom,et al.  Query Optimization for XML , 1999, VLDB.

[79]  Tova Milo,et al.  Optimizing queries on files , 1994, SIGMOD '94.

[80]  Ronald R. Yager,et al.  Quantifier guided aggregation using OWA operators , 1996, Int. J. Intell. Syst..