Cost-Aware Processing of Similarity Queries in Structured Overlays

Large-scale distributed data management with P2P systems requires the existence of similarity operators for queries as we cannot assume that all users agree on exactly the same schema and value representations and data quality problems due to spelling errors and typos. In this paper, we present an approach for efficient processing of similarity selections and joins in a structured overlay. We show that there are several possible strategies exploiting DHT features to a different extent (i.e., key organization, routing, multicasting) and thus the choice of the best operator implementation in a given situation (selectivity, data distribution, load) should be based on cost information allowing the system to estimate the computation and communication costs of query execution plans. Hence, we present a cost model for similarity operations on structured data in a DHT and demonstrate the efficiency of our proposal by experimental results from a large-scale PlanetLab deployment

[1]  Min Cai,et al.  RDFPeers: a scalable distributed RDF repository based on a structured peer-to-peer network , 2004, WWW '04.

[2]  David Maier,et al.  Mutant Query Plans , 2002, Inf. Softw. Technol..

[3]  David E. Culler,et al.  PlanetLab: an overlay testbed for broad-coverage services , 2003, CCRV.

[4]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[5]  Ricardo A. Baeza-Yates,et al.  A Practical q -Gram Index for Text Retrieval Allowing Errors , 2018, CLEI Electron. J..

[6]  Scott Shenker,et al.  Complex Queries in Dht-based Peer-to-peer Networks , 2002 .

[7]  G. Weikum Querying the Internet with PIER , 2005 .

[8]  William W. Cohen Data integration using similarity joins and a word-based information representation language , 2000, TOIS.

[9]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[10]  Adam Wierzbicki,et al.  Proceedings of the Sixth IEEE International Conference on Peer-to-Peer Computing , 2006 .

[11]  Mayank Bawa,et al.  LSH forest: self-tuning indexes for similarity search , 2005, WWW '05.

[12]  Karl Aberer,et al.  P-Grid: A Self-Organizing Access Structure for P2P Information Systems , 2001, CoopIS.

[13]  Manfred Hauswirth,et al.  Similarity Queries on Structured Data in Structured Overlays , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[14]  Duc A. Tran A Hierarchical Semantic Overlay Approach to P2P Similarity Search , 2005, USENIX Annual Technical Conference, General Track.

[15]  Kai-Uwe Sattler,et al.  Supporting Similarity Operations Based on Approximate String Matching on the Web , 2004, CoopIS/DOA/ODBASE.

[16]  Theoni Pitoura,et al.  Towards a Unifying Framework for Complex Query Processing over Structured Peer-to-Peer Data Networks , 2003, DBISP2P.