论文信息 - Architecture, operation, and dependability of large-scale Internet services: three case studies

Architecture, operation, and dependability of large-scale Internet services: three case studies

We describe the architecture and operational practices of three representative large-scale Internet services, and the causes of failure in two of them. We find convergence on a common architecture: division of nodes into service front-ends and back-ends, multiple levels of redundancy and load-balancing, and use of custom-written software for both production services and administrative tools. Operationally, we find a thin line between service developers and operators, and a need to coordinate problem detection and repair across administrative domains. Networking problems and operator error are the most significant contributors to failures in the systems we examined. keywords: Internet, Internet service, reliability, availability, maintainability, dependability, system architecture, service architecture

David A. Patterson | David Oppenheimer | D. Patterson | D. Oppenheimer

[1] Eric A. Brewer,et al. Lessons from Giant-Scale Services , 2001, IEEE Internet Comput..

[2] Jim Gray,et al. Why Do Computers Stop and What Can Be Done About It? , 1986, Symposium on Reliability in Distributed Software and Database Systems.

[3] Brendan Murphy,et al. Measuring system and software reliability using an automated data collection process , 1995 .

[4] D. Richard Kuhn,et al. Sources of Failure in the Public Switched Telephone Network , 1997, Computer.