Continuous multi-way joins over distributed hash tables

This paper studies the problem of evaluating continuous multi-way joins on top of Distributed Hash Tables (DHTs). We present a novel algorithm, called recursive join (RJoin), that takes into account various parameters crucial in a distributed setting i.e., network traffic, query processing load distribution, storage load distribution etc. The key idea of RJoin is incremental evaluation: as relevant tuples arrive continuously, a given multi-way join is rewritten continuously into a join with fewer join operators, and is assigned continuously to different nodes of the network. In this way, RJoin distributes the responsibility of evaluating a continuous multi-way join to many network nodes by assigning parts of the evaluation of each binary join to a different node depending on the values of the join attributes. The actual nodes to be involved are decided by RJoin dynamically after taking into account the rate of incoming tuples with values equal to the values of the joined attributes. RJoin also supports sliding window joins which is a crucial feature, especially for long join paths, since it provides a mechanism to reduce the query processing state and thus keep the cost of handling incoming tuples stable. In addition, RJoin is able to handle message delays due to heavy network traffic. We present a detailed mathematical and experimental analysis of RJoin and study the performance tradeoffs that occur.

[1]  Rajeev Motwani,et al.  The price of validity in dynamic networks , 2004, SIGMOD '04.

[2]  David J. DeWitt,et al.  NiagaraCQ: a scalable continuous query system for Internet databases , 2000, SIGMOD '00.

[3]  Ben Y. Zhao,et al.  Tapestry: a resilient global-scale overlay for service deployment , 2004, IEEE Journal on Selected Areas in Communications.

[4]  Michael Stonebraker,et al.  Monitoring Streams - A New Class of Data Management Applications , 2002, VLDB.

[5]  Manolis Koubarakis,et al.  Distributed Evaluation of Continuous Equi-join Queries over Large Structured Overlay Networks , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[6]  David R. Karger,et al.  Simple Efficient Load-Balancing Algorithms for Peer-to-Peer Systems , 2004, SPAA '04.

[7]  Ying Xing,et al.  Scalable Distributed Stream Processing , 2003, CIDR.

[8]  Dahlia Malkhi,et al.  Estimating network size from local information , 2003, Information Processing Letters.

[9]  David J. DeWitt,et al.  NiagaraCQ: a scalable continuous query system for Internet databases , 2000, SIGMOD 2000.

[10]  Ying Xing,et al.  The Design of the Borealis Stream Processing Engine , 2005, CIDR.

[11]  Calton Pu,et al.  Continual Queries for Internet Scale Event-Driven Information Delivery , 1999, IEEE Trans. Knowl. Data Eng..

[12]  Scott Shenker,et al.  The Architecture of PIER: an Internet-Scale Query Processor , 2005, CIDR.

[13]  Karl Aberer,et al.  The essence of P2P: a reference architecture for overlay networks , 2005, Fifth IEEE International Conference on Peer-to-Peer Computing (P2P'05).

[14]  Frederick Reiss,et al.  TelegraphCQ: Continuous Dataflow Processing for an Uncertain World , 2003, CIDR.

[15]  Michael J. Franklin,et al.  PSoup: a system for streaming queries over streaming data , 2003, The VLDB Journal.

[16]  Samuel Madden,et al.  Continuously adaptive continuous queries over streams , 2002, SIGMOD '02.

[17]  Donald Kossmann,et al.  The state of the art in distributed query processing , 2000, CSUR.

[18]  Scott Shenker,et al.  Querying the Internet with PIER , 2003, VLDB.

[19]  Ugur Çetintemel,et al.  Locality Aware Networked Join Evaluation , 2005, 21st International Conference on Data Engineering Workshops (ICDEW'05).

[20]  Lukasz Golab,et al.  Processing Sliding Window Multi-Joins in Continuous Queries over Data Streams , 2003, VLDB.

[21]  Jennifer Widom,et al.  The CQL continuous query language: semantic foundations and query execution , 2006, The VLDB Journal.

[22]  GoldbergDavid,et al.  Continuous queries over append-only databases , 1992 .

[23]  J. Hellerstein,et al.  A Wakeup Call for Internet Monitoring Systems : The Case for Distributed Triggers , 2004 .

[24]  Ling Liu,et al.  PeerCQ: a decentralized and self-configuring peer-to-peer information monitoring system , 2003, 23rd International Conference on Distributed Computing Systems, 2003. Proceedings..

[25]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[26]  Margo I. Seltzer,et al.  A Cost-Space Approach to Distributed Query Optimization in Stream Based Overlays , 2005, 21st International Conference on Data Engineering Workshops (ICDEW'05).

[27]  Rajeev Motwani,et al.  The price of validity in dynamic networks , 2007, J. Comput. Syst. Sci..