Architecture, operation, and dependability of large-scale Internet services: three case studies

We describe the architecture and operational practices of three representative large-scale Internet services, and the causes of failure in two of them. We find convergence on a common architecture: division of nodes into service front-ends and back-ends, multiple levels of redundancy and load-balancing, and use of custom-written software for both production services and administrative tools. Operationally, we find a thin line between service developers and operators, and a need to coordinate problem detection and repair across administrative domains. Networking problems and operator error are the most significant contributors to failures in the systems we examined. keywords: Internet, Internet service, reliability, availability, maintainability, dependability, system architecture, service architecture

[1]  Eric A. Brewer,et al.  Lessons from Giant-Scale Services , 2001, IEEE Internet Comput..

[2]  Jim Gray,et al.  Why Do Computers Stop and What Can Be Done About It? , 1986, Symposium on Reliability in Distributed Software and Database Systems.

[3]  Brendan Murphy,et al.  Measuring system and software reliability using an automated data collection process , 1995 .

[4]  D. Richard Kuhn,et al.  Sources of Failure in the Public Switched Telephone Network , 1997, Computer.