Why Computer Systems Fail PODS

The last years have seen dramatic changes in the way the world st es, manages, and uses its data. Whereas the end of the previous millennium saw an emphasis on decentr alized, desktop-based storage and manipulation of data, we are now seeing information technology returning in reat part to a model of centralized, professionallymanaged computation and data delivery services, using incr easingly more ubiquitous networks. This is however not a return to the era of mainframes, but rather a segregatio n of the IT world into small, end-consumer devices on one hand, and the infrastructures to support and interconne t th se devices on the other. One might argue this is a more natural division of functionality. This model of infrastructure-provided compute services ha s enabled a rich landscape of technological enhancements to our daily lives, ranging from e-mail, SMS, to e-bank ing, e-government, and e-voting. At the same time, the new usage models have encouraged outsourcing, have brou ght about large clusters of commodity hardware, and generally have pushed the providers to optimize for cost and peak performance, while initially ignoring traditional mainframe virtues like reliability and manageabili ty. In such a service-oriented world, dependable computing inf rastructures become indispensable. For example, organizations rely increasingly more on e-mail and instant messaging for communicating and making decisions on business-critical matters; Osterman [2001] found that los s of e-mail access (whether due to downtime, viruses, or other reasons) can result in up to a 35% decrease in worker pro ductivity. Infrastructures are critical to the viability of many sectors of the economy; according to Patterson [2002 ], service outages can cost tens of thousands to millions of dollars per hour, with collateral damage rangin g from loss of customers to drops in market valuation, as described by Fox and Patterson [2002]. As witnessed just a few years ago (Poulsen [2004]), the risks of failure

[1]  David Weimer Bibliography , 2018, Medical History. Supplement.

[2]  George Candea,et al.  Microreboot - A Technique for Cheap Recovery , 2004, OSDI.

[3]  Alan P. Wood,et al.  Software Reliability from the Customer View , 2003, Computer.

[4]  David A. Patterson,et al.  Undo for Operators: Building an Undoable E-mail Store , 2003, USENIX Annual Technical Conference, General Track.

[5]  Archana Ganapathi,et al.  Why Do Internet Services Fail, and What Can Be Done About It? , 2002, USENIX Symposium on Internet Technologies and Systems.

[6]  David A. Patterson,et al.  A Simple Way to Estimate the Cost of Downtime , 2002, LISA.

[7]  G. Tassey The economic impacts of inadequate infrastructure for software testing , 2002 .

[8]  Armando Fox When Does Fast Recovery Trump High Reliability , 2002 .

[9]  David A. Patterson,et al.  Lessons from the PSTN for Dependable Computing , 2002 .

[10]  Steven D. Gribble,et al.  Robustness in complex systems , 2001, Proceedings Eighth Workshop on Hot Topics in Operating Systems.

[11]  Ben L. Di Vito,et al.  Formalizing space shuttle software requirements: four case studies , 1998, TSEM.

[12]  D. Richard Kuhn,et al.  Sources of Failure in the Public Switched Telephone Network , 1997, Computer.

[13]  Charles Fishman,et al.  They write the right stuff , 1996 .

[14]  Alan Wood,et al.  Predicting Client/Server Availability , 1995, Computer.

[15]  Natarajan Shankar,et al.  Formal Verification for Fault-Tolerant Architectures: Prolegomena to the Design of PVS , 1995, IEEE Trans. Software Eng..

[16]  Brendan Murphy,et al.  Measuring system and software reliability using an automated data collection process , 1995 .

[17]  Mark Sullivan,et al.  Software defects and their impact on system availability-a study of field failures in operating systems , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[18]  Jim Gray,et al.  A census of Tandem system availability between 1985 and 1990 , 1990 .

[19]  Hans Jörg Wingender Reliability Data Collection and Use in Risk and Availability Assessment , 1986 .

[20]  R. H. Pope Human Performance: What Improvement from Human Reliability Assessment , 1986 .

[21]  Jim Gray,et al.  Why Do Computers Stop and What Can Be Done About It? , 1986, Symposium on Reliability in Distributed Software and Database Systems.

[22]  William B. Rouse,et al.  Human Detection and Diagnosis of System Failures , 1981 .