Optimizing data aggregation for cluster-based internet services

Large-scale cluster-based Internet services often host partitioned datasets to provide incremental scalability. The aggregation of results produced from multiple partitions is a fundamental building block for the delivery of these services. This paper presents the design and implementation of a programming primitive -- Data Aggregation Call (DAC) -- to exploit partition parallelism for cluster-based Internet services. A DAC request specifies a local processing operator and a global reduction operator, and it aggregates the local processing results from participating nodes through the global reduction operator. Applications may allow a DAC request to return partial aggregation results as a tradeoff between quality and availability. Our architecture design aims at improving interactive responses with sustained throughput for typical cluster environments where platform heterogeneity and software/hardware failures are common. At the cluster level, our load-adaptive reduction tree construction algorithm balances processing and aggregation load across servers while exploiting partition parallelism. Inside each node, we employ an event-driven thread pool design that prevents slow nodes from adversely affecting system throughput under highly concurrent workload. We further devise a staged timeout scheme that eagerly prunes slow or unresponsive servers from the reduction tree to meet soft deadlines. We have used the DAC primitive to implement several applications: a search engine document retriever, a parallel protein sequence matcher, and an online parallel facial recognizer. Our experimental and simulation results validate the effectiveness of the proposed optimization techniques for reducing response time, improving throughput, and gracefully handling server unresponsiveness. We also demonstrate the ease-of use of the DAC primitive and the scalability of our architecture design.

[1]  Domenico Ferrari A Study of Load Indices for Load Balancing Schemes , 1985 .

[2]  Songnian Zhou An Experimental Assessment of Resource Queue Lengths as Load Indices , 1986 .

[3]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[4]  M. Turk,et al.  Eigenfaces for Recognition , 1991, Journal of Cognitive Neuroscience.

[5]  Ravishankar K. Iyer,et al.  Prediction-Based Dynamic Load-Sharing Heuristics , 1993, IEEE Trans. Parallel Distributed Syst..

[6]  Jeffrey F. Naughton,et al.  Adaptive parallel aggregation algorithms , 1995, SIGMOD '95.

[7]  Jack Dongarra,et al.  MPI: The Complete Reference , 1996 .

[8]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[9]  Oscar H. Ibarra,et al.  SWEB: towards a scalable World Wide Web server on multicomputers , 1996, Proceedings of International Conference on Parallel Processing.

[10]  Eric A. Brewer,et al.  Cluster-based scalable network services , 1997, SOSP.

[11]  Erich M. Nahum,et al.  Locality-aware request distribution in cluster-based network servers , 1998, ASPLOS VIII.

[12]  Dhabaleswar K. Panda,et al.  Efficient collective communication on heterogeneous networks of workstations , 1998, Proceedings. 1998 International Conference on Parallel Processing (Cat. No.98EX205).

[13]  B. Bershad,et al.  Manageability, availability and performance in Porcupine: a highly scalable, cluster-based mail service , 1999, SOSP.

[14]  Michael Mitzenmacher,et al.  On the Analysis of Randomized Load Balancing Schemes , 1997, SPAA '97.

[15]  David E. Culler,et al.  The multispace: an evolutionary platform for infrastructural services , 1999 .

[16]  Eric A. Brewer,et al.  Harvest, yield, and scalable tolerant systems , 1999, Proceedings of the Seventh Workshop on Hot Topics in Operating Systems.

[17]  Willy Zwaenepoel,et al.  Flash: An efficient and portable Web server , 1999, USENIX Annual Technical Conference, General Track.

[18]  Henri E. Bal,et al.  MagPIe: MPI's collective communication operations for clustered wide area systems , 1999, PPoPP '99.

[19]  Bronis R. de Supinski,et al.  Exploiting hierarchy in parallel computer networks to optimize collective operation performance , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[20]  Joel H. Saltz,et al.  A Hypergraph-Based Workload Partitioning Strategy for Parallel Data Aggregation , 2001, PPSC.

[21]  Tao Yang,et al.  Optimizing threaded MPI execution on SMP clusters , 2001, ICS '01.

[22]  Tao Yang,et al.  Neptune: Scalable Replication Management and Programming Support for Cluster-based Network Services , 2001, USITS.

[23]  David E. Culler,et al.  SEDA: an architecture for well-conditioned, scalable internet services , 2001, SOSP.

[24]  David E. Culler,et al.  Ninja: A Framework for Network Services , 2002, USENIX Annual Technical Conference, General Track.

[25]  Wei Hong,et al.  Proceedings of the 5th Symposium on Operating Systems Design and Implementation Tag: a Tiny Aggregation Service for Ad-hoc Sensor Networks , 2022 .

[26]  Tao Yang,et al.  Proceedings of the 5th Symposium on Operating Systems Design and Implementation Integrated Resource Management for Cluster-based Internet Services , 2022 .

[27]  Tao Yang,et al.  Cluster load balancing for fine-grain network services , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.