A distributed selectivity-driven search strategy for semi-structured data over DHT-based networks

Distributed Hash Tables (DHTs) are widely used for indexing and locating many types of resources, including semi-structured data modeled as XML documents. A common distributed strategy to process an XML query over a DHT consists in splitting it into a set of simple path queries, and resolving each of them separately. The traffic generated by this strategy grows with the number of paths in the query. To overcome this drawback, an alternative strategy consists in resolving only the sub-query associated with the most selective path, and then submitting the original query to the nodes in the result set. A first goal of this paper is to provide an analytical and experimental study of the two strategies to assess their relative merits in different scenarios. On the basis of this study, we introduce an Adaptive Path Selection (APS) search technique that resolves an XML query in a distributed way by querying either the most selective path or the whole path set, based on the selectivity of the paths in the query. The effective use of APS requires that the querying nodes know in advance the selectivity of all the paths. Addressing this problem is another goal of the paper, which is achieved through: (i) The definition of a space-efficient data structure, the Path Selectivity Table (PST), which given any path, returns an estimate of its selectivity. (ii) The definition of an efficient strategy that builds the PST in a distributed way and propagates it to all nodes in the network with logarithmic performance bounds and without redundant messages. Experimental results show that the PST accurately estimates the path selectivity values, and that the traffic generated by the APS algorithm using PST-estimated selectivity values is comparable to that produced by APS assuming to know the real path selectivity values. A DHT-based framework for indexing and locating XML data over distributed networks.An Adaptive Path Selection (APS) search algorithm that minimizes network traffic.A space-efficient Path Selectivity Table (PST) for path selectivity estimation.A distributed algorithm for PST construction with logarithmic performance bounds.

[1]  Domenico Talia,et al.  Enabling Dynamic Querying over Distributed Hash Tables , 2010, J. Parallel Distributed Comput..

[2]  Bongki Moon,et al.  Locating XML Documents in a Peer-to-Peer Network Using Distributed Hash Tables , 2009, IEEE Transactions on Knowledge and Data Engineering.

[3]  Sebastian Maneth,et al.  Structural Selectivity Estimation for XML Documents , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[4]  Ioana Manolescu,et al.  XML processing in DHT networks , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[5]  M. Alrammal,et al.  A stream-based selectivity estimation technique for forward XPath , 2012, 2012 International Conference on Innovations in Information Technology (IIT).

[6]  Pedro A. Szekely,et al.  MAAN: A Multi-Attribute Addressable Network for Grid Information Services , 2003, Proceedings. First Latin American Web Congress.

[7]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[8]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[9]  Ali Ghodsi,et al.  Multicast and Bulk Lookup in Structured Overlay Networks , 2010 .

[10]  M. Tamer Özsu,et al.  XSEED: Accurate and Fast Cardinality Estimation for XPath Queries , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[11]  Riham Abdel Kader XQuery optimization in relational database systems , 2007, VLDB 2007.

[12]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[13]  Li Fan,et al.  Summary cache: a scalable wide-area web cache sharing protocol , 2000, TNET.

[14]  Ben Y. Zhao,et al.  Tapestry: a resilient global-scale overlay for service deployment , 2004, IEEE Journal on Selected Areas in Communications.

[15]  Torsten Suel,et al.  Optimal Histograms with Quality Guarantees , 1998, VLDB.

[16]  Johannes Gehrke,et al.  Gossip-based computation of aggregate information , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[17]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[18]  Domenico Talia,et al.  A DHT-based semantic overlay network for service discovery , 2012, Future Gener. Comput. Syst..

[19]  David J. DeWitt,et al.  Locating Data Sources in Large Distributed Systems , 2003, VLDB.

[20]  Manolis Koubarakis,et al.  Xml data dissemination using automata on top of structured overlay networks , 2008, WWW.

[21]  Ioana Manolescu,et al.  ViP2P: Efficient XML Management in DHT Networks , 2012, ICWE.

[22]  Jeffrey F. Naughton,et al.  Estimating the Selectivity of XML Path Expressions for Internet Scale Applications , 2001, VLDB.

[23]  Domenico Talia,et al.  Selectivity-based XML query processing in structured peer-to-peer networks , 2010, IDEAS '10.

[24]  Neoklis Polyzotis,et al.  XCluster Synopses for Structured XML Content , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[25]  Seif Haridi,et al.  Efficient Broadcast in Structured P2P Networks , 2003, IPTPS.

[26]  Jeffrey Scott Vitter,et al.  XPathLearner: An On-line Self-Tuning Markov Histogram for XML Path Selectivity Estimation , 2002, VLDB.

[27]  Juliana Freire,et al.  StatiX: making XML count , 2002, SIGMOD '02.

[28]  Jignesh M. Patel,et al.  Estimating Answer Sizes for XML Queries , 2002, EDBT.

[29]  Neoklis Polyzotis,et al.  Statistical synopses for graph-structured XML databases , 2002, SIGMOD '02.

[30]  Neoklis Polyzotis,et al.  Approximate XML query answers , 2004, SIGMOD '04.

[31]  Vasil Slavov,et al.  A gossip-based approach for Internet-scale cardinality estimation of XPath queries over distributed semistructured data , 2014, The VLDB Journal.

[32]  David Mazières,et al.  Kademlia: A Peer-to-Peer Information System Based on the XOR Metric , 2002, IPTPS.

[33]  Hongjun Lu,et al.  Bloom Histogram: Path Selectivity Estimation for XML Data with Updates , 2004, VLDB.

[34]  Karl Aberer,et al.  P-Grid: a self-organizing structured P2P system , 2003, SGMD.

[35]  Maya Ramanath,et al.  IMAX: incremental maintenance of schema-based XML statistics , 2005, 21st International Conference on Data Engineering (ICDE'05).

[36]  Wen-Chi Hou,et al.  A sampling approach for XML query selectivity estimation , 2009, EDBT '09.

[37]  Karl Aberer,et al.  Peer-to-Peer Data Management , 2011, Peer-to-Peer Data Management.

[38]  Evaggelia Pitoura,et al.  Peer-to-peer management of XML data: issues and research challenges , 2005, SGMD.

[39]  Karl Aberer,et al.  Efficient Processing of XPath Queries with Structured Overlay Networks , 2005, OTM Conferences.

[40]  Guillaume Urvoy-Keller,et al.  Data indexing in peer-to-peer DHT networks , 2004, 24th International Conference on Distributed Computing Systems, 2004. Proceedings..

[41]  Peter Druschel,et al.  Pastry: Scalable, distributed object location and routing for large-scale peer-to- , 2001 .