ACMS: the Akamai configuration management system

An important trend in information technology is the use of increasingly large distributed systems to deploy increasingly complex and mission-critical applications. In order for these systems to achieve the ultimate goal of having similar ease-of-use properties as centralized systems they must allow fast, reliable, and lightweight management and synchronization of their configuration state. This goal poses numerous technical challenges in a truly Internet-scale system, including varying degrees of network connectivity, inevitable machine failures, and the need to distribute information globally in a fast and reliable fashion. In this paper we discuss the design and implementation of a configuration management system for the Akamai Network. It allows reliable yet highly asynchronous delivery of configuration information, is significantly fault-tolerant, and can scale if necessary to hundreds of thousands of servers. The system is fully functional today providing configuration management to over 15,000 servers deployed in 1200+ different networks in 60+ countries.

[1]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[2]  Kirk L. Johnson,et al.  Overcast: reliable multicasting with on overlay network , 2000, OSDI.

[3]  Sanjoy Paul,et al.  Reliable Multicast Transport Protocol (RMTP) , 1997, IEEE J. Sel. Areas Commun..

[4]  Jim Gray,et al.  Notes on Data Base Operating Systems , 1978, Advanced Course: Operating Systems.

[5]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[6]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[7]  Stephen E. Deering,et al.  Host extensions for IP multicasting , 1986, RFC.

[8]  Ben Y. Zhao,et al.  OceanStore: an architecture for global-scale persistent storage , 2000, SIGP.

[9]  Ben Y. Zhao,et al.  Bayeux: an architecture for scalable and fault-tolerant wide-area data dissemination , 2001, NOSSDAV '01.

[10]  Peter Druschel,et al.  Pastry: Scalable, distributed object location and routing for large-scale peer-to- , 2001 .

[11]  Magnus Karlsson,et al.  Taming aggressive replication in the Pangaea wide-area file system , 2002, OPSR.

[12]  Srinivasan Seshan,et al.  A case for end system multicast , 2002, IEEE J. Sel. Areas Commun..

[13]  Hui Zhang,et al.  A case for end system multicast (keynote address) , 2000, SIGMETRICS '00.

[14]  Steven McCanne,et al.  RMX: reliable multicast for heterogeneous networks , 2000, Proceedings IEEE INFOCOM 2000. Conference on Computer Communications. Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies (Cat. No.00CH37064).

[15]  Miguel Castro,et al.  SCRIBE: The Design of a Large-Scale Event Notification Infrastructure , 2001, Networked Group Communication.

[16]  Ben Y. Zhao,et al.  Tapestry: An Infrastructure for Fault-tolerant Wide-area Location and , 2001 .

[17]  Helen J. Wang,et al.  An evaluation of scalable application-level multicast built using peer-to-peer overlays , 2003, IEEE INFOCOM 2003. Twenty-second Annual Joint Conference of the IEEE Computer and Communications Societies (IEEE Cat. No.03CH37428).

[18]  Marvin Theimer,et al.  Flexible update propagation for weakly consistent replication , 1997, SOSP.

[19]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM 2001.

[20]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM '01.

[21]  Hector Garcia-Molina,et al.  Consistency in a partitioned network: a survey , 1985, CSUR.

[22]  Mark Handley,et al.  Application-Level Multicast Using Content-Addressable Networks , 2001, Networked Group Communication.

[23]  David K. Gifford,et al.  Weighted voting for replicated data , 1979, SOSP '79.

[24]  Akhil Kumar,et al.  Hierarchical Quorum Consensus: A New Algorithm for Managing Replicated Data , 1991, IEEE Trans. Computers.

[25]  Miguel Oom Temudo de Castro,et al.  Practical Byzantine fault tolerance , 1999, OSDI '99.

[26]  Ben Y. Zhao,et al.  An Infrastructure for Fault-tolerant Wide-area Location and Routing , 2001 .

[27]  Friedemann Mattern,et al.  Virtual Time and Global States of Distributed Systems , 2002 .

[28]  Nancy A. Lynch,et al.  RAMBO: A Reconfigurable Atomic Memory Service for Dynamic Networks , 2002, DISC.

[29]  Mahadev Satyanarayanan,et al.  Scalable, secure, and highly available distributed file access , 1990, Computer.

[30]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[31]  David R. Karger,et al.  Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web , 1997, STOC '97.