Scalable Distributed Aggregate Computations Through Collaboration

Computing aggregates over distributed data sets constitutes an interesting class of distributed queries. Recent advances in peer-to-peer discovery of data sources and query processing techniques have made such queries feasible and potentially more frequent. The concurrent execution of multiple and often identical distributed aggregate queries can place a high burden on the data sources. This paper identifies the scalability bottlenecks that can arise in large peer-to-peer networks from the execution of large numbers of aggregate computations and proposes a solution. In our approach peers are assigned the role of aggregate computation maintainers, which leads to a substantial decrease in requests to the data sources and also avoids duplicate computation by the sites that submit identical aggregate queries. Moreover, a framework is presented that facilitates the collaboration of peers in maintaining aggregate query results. Experimental evaluation of our design demonstrates that it achieves very good performance and scales to thousands of peers.

[1]  Johannes Gehrke,et al.  Gossip-based computation of aggregate information , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[2]  Ben Y. Zhao,et al.  Tapestry: An Infrastructure for Fault-tolerant Wide-area Location and , 2001 .

[3]  David R. Karger,et al.  Wide-area cooperative storage with CFS , 2001, SOSP.

[4]  DruschelPeter,et al.  Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility , 2001 .

[5]  David J. DeWitt,et al.  Towards a data-centric internet , 2004 .

[6]  Ioana Manolescu,et al.  XMark: A Benchmark for XML Data Management , 2002, VLDB.

[7]  Peter Druschel,et al.  Pastry: Scalable, distributed object location and routing for large-scale peer-to- , 2001 .

[8]  David J. DeWitt,et al.  Locating Data Sources in Large Distributed Systems , 2003, VLDB.

[9]  Scott Shenker,et al.  Querying the Internet with PIER , 2003, VLDB.

[10]  Ioana Manolescu,et al.  A Benchmark for XML Data Management , 2002 .

[11]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[12]  Scott Shenker The Data-Centric Revolution in Networking , 2003, VLDB.

[13]  Robbert van Renesse,et al.  Willow: DHT, Aggregation, and Publish/Subscribe in One Protocol , 2004, IPTPS.

[14]  Ben Y. Zhao,et al.  An Infrastructure for Fault-tolerant Wide-area Location and Routing , 2001 .

[15]  Antony I. T. Rowstron,et al.  Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility , 2001, SOSP.

[16]  Stamatis Vassiliadis,et al.  A peer-to-peer agent auction , 2002, AAMAS '02.

[17]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[18]  Ben Y. Zhao,et al.  OceanStore: an architecture for global-scale persistent storage , 2000, SIGP.

[19]  Jing Zhu,et al.  SOMO: Self-Organized Metadata Overlay for Resource Management in P2P DHT , 2003, IPTPS.

[20]  Miguel Castro,et al.  SCRIBE: The Design of a Large-Scale Event Notification Infrastructure , 2001, Networked Group Communication.

[21]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM '01.

[22]  Joseph M. Hellerstein,et al.  Toward network data independence , 2003, SGMD.