Compact Labeling Scheme for XML Ancestor Queries

AbstractXML documents are often viewed as trees (basically the parse tree of the document), and queries over such documents typically test for ancestor relationships among tree nodes. Search engines process such queries using an index structure summarizing the ancestor relations. In the index, each document item (tree node) is identified using some logical id (node label), such that, given two labels, the engine can determine the ancestor relationship between the corresponding nodes. The length of the labels is a main factor of the index size. Therefore, reducing this length, even by a constant factor, is a critical issue. In this work we consider the following problem. Given a rooted XML tree T, label the nodes of T in the most compact way such that given the labels of two nodes, one can determine in constant time, by looking at the labels only, whether one node is an ancestor of the other. Labelings currently being used are all variants of the following interval scheme. Number the leaves say from left to right and label each node with a pair consisting of the numbers of its smallest and largest leaf descendants. An ancestor query then amounts to an interval containment test on the labels. The maximum label length using this scheme is 2 log n, where n is the number of nodes in the tree. (All logarithms in this paper are to base 2.) The focus of this work is finding a scheme that works best in practice on real XML data. We suggest an orthogonal prefix-based approach, where the labeling is such that an ancestor query roughly amounts to testing whether one label is a prefix of the other. We present several new labeling schemes based on this approach and analyze their performance both theoretically and empirically.

[1]  M. Ronan Sleep,et al.  Uniform Random Generation of Balanced Parenthesis Strings , 1980, TOPL.

[2]  Stephen Alstrup,et al.  Nearest common ancestors: a survey and a new distributed algorithm , 2002, SPAA.

[3]  Uzi Vishkin,et al.  Recursive Star-Tree Parallel Data Structure , 1993, SIAM J. Comput..

[4]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[5]  Ran Raz,et al.  Distance labeling in graphs , 2001, SODA '01.

[6]  Donald E. Knuth,et al.  Optimum binary search trees , 1971, Acta Informatica.

[7]  David J. DeWitt,et al.  NiagaraCQ: a scalable continuous query system for Internet databases , 2000, SIGMOD 2000.

[8]  David J. DeWitt,et al.  NiagaraCQ: a scalable continuous query system for Internet databases , 2000, SIGMOD '00.

[9]  Uzi Vishkin,et al.  On Finding Lowest Common Ancestors: Simplification and Parallelization , 1988, AWOC.

[10]  Adriano M. Garsia,et al.  A New Algorithm for Minimum Cost Binary Trees , 1977, SIAM J. Comput..

[11]  Harold N. Gabow,et al.  Data structures for weighted matching and nearest common ancestors with linking , 1990, SODA '90.

[12]  Dan Suciu,et al.  A query language and optimization techniques for unstructured data , 1996, SIGMOD '96.

[13]  Alin Deutsch,et al.  A Query Language for XML , 1999, Comput. Networks.

[14]  Robert E. Tarjan,et al.  Fast Algorithms for Finding Nearest Common Ancestors , 1984, SIAM J. Comput..

[15]  Haim Kaplan,et al.  Partial alphabetic trees , 2006, J. Algorithms.

[16]  Stephen Alstrup,et al.  Improved labeling scheme for ancestor queries , 2002, SODA '02.

[17]  Kurt Mehlhorn,et al.  A Best Possible Bound for the Weighted Path Length of Binary Search Trees , 1977, SIAM J. Comput..

[18]  T. C. Hu,et al.  BINARY TREES OPTIMUM UNDER VARIOUS CRITERIA , 1979 .

[19]  Jennifer Widom,et al.  The Lorel query language for semistructured data , 1997, International Journal on Digital Libraries.

[20]  David Peleg,et al.  Proximity-Preserving Labeling Schemes and Their Applications , 1999, WG.

[21]  Dan Suciu,et al.  Data on the Web: From Relations to Semistructured Data and XML , 1999 .

[22]  Haim Kaplan,et al.  Compact labeling schemes for ancestor queries , 2001, SODA '01.

[23]  Philip Bille,et al.  Labeling schemes for small distances in trees , 2003, SODA '03.

[24]  E. F. Moore,et al.  Variable-length binary encodings , 1959 .

[25]  Richard Cole,et al.  Dynamic LCA queries on trees , 1999, SODA '99.

[26]  Michael A. Bender,et al.  The LCA Problem Revisited , 2000, LATIN.

[27]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[28]  David Peleg Informative Labeling Schemes for Graphs , 2000, MFCS.