Exploiting local similarity for indexing paths in graph-structured data

XML and other semi-structured data may have partially specified or missing schema information, motivating the use of a structural summary which can be automatically computed from the data. These summaries also serve as indices for evaluating the complex path expressions common to XML and semi-structured query languages. However, to answer all path queries accurately, summaries must encode information about long, seldom-queried paths, leading to increased size and complexity with little added value. We introduce the A(k)-indices, a family of approximate structural summaries. They are based on the concept of k-bisimilarity, in which nodes are grouped based on local structure, i.e., the incoming paths of length up to k. The parameter k thus smoothly varies the level of detail (and accuracy) of the A(k)-index. For small values of k, the size of the index is substantially reduced. While smaller, the A(k) index is approximate, and we describe techniques for efficiently extracting exact answers to regular path queries. Our experiments show that, for moderate values of k, path evaluation using the A(k)-index ranges from being very efficient for simple queries to competitive for most complex queries, while using significantly less space than comparable structures.

[1]  Jennifer Widom,et al.  Indexing Semistructured Data , 1998 .

[2]  Donald D. Chamberlin,et al.  XQuery: a query language for XML , 2003, SIGMOD '03.

[3]  Michael J. Franklin,et al.  A Fast Index for Semistructured Data , 2001, VLDB.

[4]  Dan Suciu,et al.  Index Structures for Path Expressions , 1999, ICDT.

[5]  Alberto O. Mendelzon,et al.  Indexing XML Data with ToXin , 2001, WebDB.

[6]  David Lee,et al.  Online minimization of transition systems (extended abstract) , 1992, STOC '92.

[7]  Dan Suciu,et al.  UnQL: a query language and algebra for semistructured data based on structural recursion , 2000, The VLDB Journal.

[8]  Quanzhong Li,et al.  Indexing and Querying XML Data for Regular Path Expressions , 2001, VLDB.

[9]  Dominic A. Orchard,et al.  XML Linking Language (XLink) Version 1. 0. World Wide Web Consortium, Proposed Recommendation PR - x , 2000 .

[10]  Mary Fernandez XML Query Languages: Experiences and Exemplars , 2001 .

[11]  Jeffrey F. Naughton,et al.  Estimating the Selectivity of XML Path Expressions for Internet Scale Applications , 2001, VLDB.

[12]  Steven J. DeRose,et al.  XML Path Language (XPath) , 1999 .

[13]  toExcel Extensible Stylesheet Language: Xsl Version 1.0 , 1999 .

[14]  Jeffrey D. Ullman,et al.  Representative objects: concise representations of semistructured, hierarchical data , 1997, Proceedings 13th International Conference on Data Engineering.

[15]  Robert E. Tarjan,et al.  Three Partition Refinement Algorithms , 1987, SIAM J. Comput..

[16]  Kyuseok Shim,et al.  XTRACT: a system for extracting document type descriptors from XML documents , 2000, SIGMOD '00.

[17]  Anders Berglund,et al.  Extensible Stylesheet Language (XSL) Version 1.0 , 1998 .

[18]  Alin Deutsch,et al.  A Query Language for XML , 1999, Comput. Networks.

[19]  Serge Abiteboul,et al.  Extracting schema from semistructured data , 1998, SIGMOD '98.

[20]  Roy Goldman,et al.  DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases , 1997, VLDB.

[21]  David Park,et al.  Concurrency and Automata on Infinite Sequences , 1981, Theoretical Computer Science.

[22]  Serge Abiteboul,et al.  Querying Semi-Structured Data , 1997, Encyclopedia of Database Systems.

[23]  J. Widom,et al.  Approximate DataGuides , 1998 .

[24]  Dan Suciu,et al.  Data on the Web: From Relations to Semistructured Data and XML , 1999 .

[25]  Forouzan Golshani,et al.  Proceedings of the Eighth International Conference on Data Engineering , 1992 .

[26]  Roy Goldman,et al.  Lore: a database management system for semistructured data , 1997, SGMD.

[27]  Sharon C. Adler Previous version: , 1997 .

[28]  Jennifer Widom,et al.  Querying Semistructured Heterogeneous Information , 1995, J. Syst. Integr..

[29]  Matthias Brosemann,et al.  XML Path Language (XPath) 1.0 — Seminararbeit — , 2004 .

[30]  Aristides Gionis,et al.  XTRACT: a system for extracting document type descriptors from XML documents , 2000, SIGMOD 2000.

[31]  Dan Suciu,et al.  Declarative specification of Web sites with Strudel , 2000, The VLDB Journal.