The One Minute Manager: Lightweight On-Demand Overlays for Distributed Application Management

The emergence of large-scale distributed computing clusters such as PlanetLab and Utility Grids has fueled the development of applications ranging from content distribution to name service to large-scale prototype experiments. However, the management of such applications when they are deployed in a real world, wide area environment remains a challenging problem. In this paper, we present MON (Management Overlay Networks), a simple, scalable and lightweight system for distributed application management. At the most basic level, MON builds short-lived, on-demand overlays that can be used to execute management commands such as status query and control. To further address the coverage, reliability and performance issues of on-demand overlays, we exploit techniques such as incremental overlay construction, overlay adjustment and opportunistic DAG (directed acyclic graph) based aggregation, which greatly improve the practicality of on-demand overlays. Our extensive experiments on the PlanetLab show that for a large group of more than 300 nodes, on-demand overlays can be built to (1) cover more than 95% of the nodes (2) last for tens of minutes even without failure repairs; and (3) achieve an end-to-end response time of just a couple of seconds. Further, we demonstrate the utility of MON by showing how it can be used to query the aggregate state of a real application (Pastry) deployed in a real world environment.

[1]  Indranil Gupta,et al.  A churn-resistant peer-to-peer web caching system , 2003, SSRS '03.

[2]  Robbert van Renesse,et al.  Astrolabe: A robust and scalable technology for distributed system monitoring, management, and data mining , 2003, TOCS.

[3]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM '01.

[4]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[5]  Doug Terry,et al.  Epidemic algorithms for replicated database maintenance , 1988, OPSR.

[6]  Antony I. T. Rowstron,et al.  Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility , 2001, SOSP.

[7]  Wei Hong,et al.  Proceedings of the 5th Symposium on Operating Systems Design and Implementation Tag: a Tiny Aggregation Service for Ad-hoc Sensor Networks , 2022 .

[8]  Ian T. Foster,et al.  Grid information services for distributed resource sharing , 2001, Proceedings 10th IEEE International Symposium on High Performance Distributed Computing.

[9]  David E. Culler,et al.  The ganglia distributed monitoring system: design, implementation, and experience , 2004, Parallel Comput..

[10]  Emin Gün Sirer,et al.  Corona: A High Performance Publish-Subscribe System for the World Wide Web , 2006, NSDI.

[11]  Anne-Marie Kermarrec,et al.  Probabilistic Reliable Dissemination in Large-Scale Systems , 2003, IEEE Trans. Parallel Distributed Syst..

[12]  Ben Y. Zhao,et al.  An Infrastructure for Fault-tolerant Wide-area Location and Routing , 2001 .

[13]  Emin Gün Sirer,et al.  The design and implementation of a next generation name service for the internet , 2004, SIGCOMM.

[14]  David E. Culler,et al.  A blueprint for introducing disruptive technology into the Internet , 2003, CCRV.

[15]  Indranil Gupta,et al.  MON: On-Demand Overlays for Distributed System Management , 2005, WORLDS.

[16]  Jeffrey C. Mogul,et al.  Emergent (mis)behavior vs. complex software systems , 2006, EuroSys.

[17]  Ian T. Foster,et al.  The Anatomy of the Grid: Enabling Scalable Virtual Organizations , 2001, Int. J. High Perform. Comput. Appl..

[18]  Miguel Castro,et al.  Scribe: a large-scale and decentralized application-level multicast infrastructure , 2002, IEEE J. Sel. Areas Commun..

[19]  Scott Shenker,et al.  Querying the Internet with PIER , 2003, VLDB.

[20]  Zhe Wang,et al.  CoDNS: Improving DNS Performance and Reliability via Cooperative Lookups , 2004, OSDI.

[21]  David R. Karger,et al.  Wide-area cooperative storage with CFS , 2001, SOSP.

[22]  Amin Vahdat,et al.  Distributed Resource Discovery on PlanetLab with SWORD , 2004, WORLDS.

[23]  Indranil Gupta,et al.  Kelips: Building an Efficient and Stable P2P DHT through Increased Memory and Background Overhead , 2003, IPTPS.

[24]  John Kubiatowicz,et al.  Handling churn in a DHT , 2004 .

[25]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[26]  Peter Druschel,et al.  Pastry: Scalable, distributed object location and routing for large-scale peer-to- , 2001 .