Recovery Oriented Computing: A New Research Agenda for a New Century

Summary form only given, as follows. After 15 years of successfully improving cost-performance, it's time for new challenges for the systems research community. As a result of the focus on cost-performance, the fabled five 9s of availability (99.999% uptime) looks to be much easier to achieve in advertising than in computers, and the cost of managing systems can be five times the cost of the hardware. In a Post-PC Era of wireless gadgets using services on the Internet, one new challenge is building services that really are dependable and much less expensive to maintain. Traditional Fault-Tolerant Computing concentrates on tolerating hardware and operating system faults, ignoring faults by human operators and even applications. Recovery Oriented Computing (ROC) aims at improving Mean Time To Recover to both lower the cost of management and improve at the availability of whole system, including the people who operate it. We look to civil engineering and diplomacy to inspire principles for ROC design. This talk outlines motivation for and proposed principles of ROC design, plus some concrete results in the area of benchmarking of availability.