Distributed Structural Relaxation of XPath Queries

Due to the structural heterogeneity of XML, queries are often interpreted approximately. This is achieved by relaxing the query and ranking the results based on their relevance to the original query. Query relaxation over distributed XML repositories may incur large communication costs, since partial result lists from different sites need to be gathered and ranked to assembly the overall top-k results. To process such queries efficiently, we propose using a distributed clustered index to group documents based on their structural similarity. The clustered index proves to be very effective in reducing the sizes of the partial lists that need to be combined. Furthermore, it can be used as the basis of a pay-as-you-go approach, where clusters of documents are accessed gradually providing the user with increasingly improving results. To reduce the cost of constructing and maintaining the clustered index, we use a compact data structure that trades-off accuracy for storage and communication efficiency. The index is also used for selectivity estimation so that query relaxation is geared towards the most promising structural transformations. Our experimental results show that our approach significantly reduces the communication cost for retrieving the top-k results, while maintaining a low construction cost for the clustered index.

[1]  Gerhard Weikum,et al.  The MINERVA Project: Database Selection in the Context of P2P Search , 2005, BTW.

[2]  Hongjun Lu,et al.  Bloom Histogram: Path Selectivity Estimation for XML Data with Updates , 2004, VLDB.

[3]  Evaggelia Pitoura,et al.  A Clustered Index Approach to Distributed XPath Processing , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[4]  N. Fuhr An Extension of XQL for Information Retrieval , 2000 .

[5]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS '01.

[6]  Ioana Manolescu,et al.  XML processing in DHT networks , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[7]  Laks V. S. Lakshmanan,et al.  FleXPath: flexible structure and full-text querying for XML , 2004, SIGMOD '04.

[8]  Gerhard Weikum,et al.  An Efficient and Versatile Query Engine for TopX Search , 2005, VLDB.

[9]  Gerhard Weikum,et al.  KLEE: A Framework for Distributed Top-k Query Algorithms , 2005, VLDB.

[10]  Neoklis Polyzotis,et al.  Approximate XML query answers , 2004, SIGMOD '04.

[11]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[12]  Evaggelia Pitoura,et al.  Content-Based Routing of Path Queries in Peer-to-Peer Systems , 2004, EDBT.

[13]  Jeffrey F. Naughton,et al.  Estimating the Selectivity of XML Path Expressions for Internet Scale Applications , 2001, VLDB.

[14]  Sihem Amer-Yahia,et al.  Adaptive processing of top-k queries in XML , 2005, 21st International Conference on Data Engineering (ICDE'05).

[15]  Sihem Amer-Yahia,et al.  Tree Pattern Relaxation , 2002, EDBT.