A recovery-oriented approach to dependable services: repairing past errors with system-wide undo

Motivated by the pressing need for increased dependability in corporate and Internet services and by the perspective that effective recovery can improve dependability as much or more than avoiding failures, we introduce a novel recovery mechanism that gives human system operators the power of system-wide undo. System-wide undo allows operators to roll back erroneous changes to a service's state without losing end-user data or updates, to make retroactive repairs in the historical timeline of the service system, and thereby to quickly recover from catastrophic state corruption, operator error, failed upgrades, and external attacks, even when the root cause of the catastrophe is unknown. We explore system-wide undo via a framework based on the novel concept of spheres of undo, bubbles of state and time that provide scope to the state recoverable by undo and serve as a structuring tool for implementing undo on standalone services, hierarchically-composed systems, and distributed interacting services. Crucially, spheres of undo allow us to define the concept of paradoxes, inconsistencies that occur when an undo process retroactively alters state that has been exposed outside of its containing sphere of undo. Managing paradoxes is the grand challenge of system-wide undo, and to tackle it we introduce a framework that automatically detects and compensates for paradoxes; our approach exploits the relaxed consistency semantics already present in existing services that interact with human end-users. We describe an implementation of our system-wide undo framework for standalone services with human end-users. We explore its applicability by assembling and evaluating a prototype undoable e-mail store service, by analyzing what would be necessary to construct an undoable online auction service, and by developing a set of guidelines to help service designers retrofit their services with undo. We find that system-wide undo functionality imposes non-negligible but tolerable overhead in terms of both time and space. Using a novel methodology we develop to benchmark human-assisted recovery processes, we also find that undo-based recovery has a net positive effect on dependability, providing significant improvements in correctness while only slightly degrading availability.

[1]  Atul Prakash,et al.  A framework for undoing actions in collaborative systems , 1994, TCHI.

[2]  Ram Chillarege,et al.  Design for fault-tolerance in system ES model 900 , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[3]  James Reason,et al.  Human Error , 1990 .

[4]  Ian Holyer,et al.  A Recovery Mechanism for Shells , 2000, Comput. J..

[5]  Takeo Igarashi,et al.  A temporal model for multi-level undo and redo , 2000, UIST '00.

[6]  Brendan Murphy,et al.  Measuring system and software reliability using an automated data collection process , 1995 .

[7]  Dan Boneh,et al.  Revocation of Unread E-mail in an Untrusted Network , 1997, ACISP.

[8]  Miguel Castro,et al.  Using abstraction to improve fault tolerance , 2001, Proceedings Eighth Workshop on Hot Topics in Operating Systems.

[9]  Mark R. Tuttle,et al.  A theory of redo recovery , 2003, SIGMOD '03.

[10]  Marco Vieira,et al.  Recovery and performance balance of a COTS DBMS in the presence of operator faults , 2002, Proceedings International Conference on Dependable Systems and Networks.

[11]  Robert S. Swarz,et al.  Reliable Computer Systems: Design and Evaluation , 1992 .

[12]  Kevin Skadron,et al.  Proceedings 29th Annual International Symposium on Computer Architecture , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[13]  Jeffrey Scott Vitter US&R: A new framework for redoing (Extended Abstract) , 1984 .

[14]  Samuel T. King,et al.  ReVirt: enabling intrusion analysis through virtual-machine logging and replay , 2002, OPSR.

[15]  Prasun Dewan,et al.  A General Multi-User Undo/Redo Model , 1995, ECSCW.

[16]  Julien M. Christensen,et al.  Field Experience in Maintenance , 1981 .

[17]  D. Richard Kuhn,et al.  Sources of Failure in the Public Switched Telephone Network , 1997, Computer.

[18]  Jim Gray,et al.  A census of Tandem system availability between 1985 and 1990 , 1990 .

[19]  Roger S. Barga,et al.  Phoenix project: fault-tolerant applications , 2002, SGMD.

[20]  Steven K. Feiner,et al.  Editable graphical histories , 1988, [Proceedings] 1988 IEEE Workshop on Visual Languages.

[21]  Marvin Theimer,et al.  Managing update conflicts in Bayou, a weakly connected replicated storage system , 1995, SOSP.

[22]  Brian Randell,et al.  Fundamental Concepts of Dependability , 2000 .

[23]  Noah Treuhaft,et al.  Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies , 2002 .

[24]  Gregory D. Abowd,et al.  Giving Undo Attention , 1992, Interact. Comput..

[25]  Sushil Jajodia,et al.  Rewriting Histories: Recovering from Malicious Transactions , 2004, Distributed and Parallel Databases.

[26]  Jim Gray,et al.  Why Do Computers Stop and What Can Be Done About It? , 1986, Symposium on Reliability in Distributed Software and Database Systems.

[27]  Robert B. Hagmann,et al.  Reimplementing the Cedar file system using logging and group commit , 1987, SOSP '87.

[28]  Roger S. Barga,et al.  Persistent applications via automatic recovery , 2003, Seventh International Database Engineering and Applications Symposium, 2003. Proceedings..

[29]  Jeffrey Scott Vitter US&R: A New Framework for Redoing , 1984, IEEE Software.

[30]  Marco Vieira,et al.  Definition of fault loads based on operator faults for DMBS recovery benchmarking , 2002, 2002 Pacific Rim International Symposium on Dependable Computing, 2002. Proceedings..

[31]  Marc Shapiro,et al.  Replication: Optimistic Approaches , 2002 .

[32]  David Gelernter,et al.  Lifestreams: a storage model for personal data , 1996, SGMD.

[33]  Eric Anderson,et al.  A Retrospective on Twelve Years of LISA Proceedings , 1999, LISA.

[34]  George B. Leeman A formal approach to undo operations in programming languages , 1986, TOPL.

[35]  Aaron B. Brown Toward System-Wide Undo for Distributed Services , 2003 .

[36]  Miguel Castro,et al.  Practical byzantine fault tolerance and proactive recovery , 2002, TOCS.

[37]  Jun Rekimoto,et al.  Time-machine computing: a time-centric approach for the information environment , 1999, UIST '99.

[38]  Lisa Spainhower,et al.  G4: a fault-tolerant CMOS mainframe , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[39]  Elizabeth D. Mynatt,et al.  Timewarp: techniques for autonomous collaboration , 1997, CHI.

[40]  Mark R. Crispin Internet Message Access Protocol - Version 4rev1 , 1996, RFC.

[41]  William B. Rouse,et al.  Models of human problem solving: Detection, diagnosis, and compensation for system failures , 1982, Autom..

[42]  Ji Zhu,et al.  Robustness benchmarking for hardware maintenance events , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[43]  Norman C. Hutchinson,et al.  Deciding when to forget in the Elephant file system , 1999, SOSP.

[44]  David A. Patterson,et al.  Towards Availability Benchmarks: A Case Study of Software RAID Systems , 2000, USENIX Annual Technical Conference, General Track.

[45]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[46]  Armando Fox When Does Fast Recovery Trump High Reliability , 2002 .

[47]  David A. Patterson,et al.  Rewind, repair, replay: three R's to dependability , 2002, EW 10.

[48]  Jens Rasmussen,et al.  Information Processing and Human-Machine Interaction , 1986 .

[49]  Jean Arlat,et al.  A Framework for Dependability Benchmarking , 2002 .

[50]  David R. Hanson,et al.  Generators in Icon , 1981, TOPL.

[51]  Larry Press,et al.  Personal computing: the post-PC era , 1999, CACM.

[52]  Milo M. K. Martin,et al.  SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[53]  Fred B. Schneider,et al.  User Recovery and Reversal in Interactive Systems , 1984, TOPL.

[54]  W. Keith Edwards,et al.  Flexible conflict detection and management in collaborative applications , 1997, UIST '97.

[55]  Chengzheng Sun,et al.  Undo any operation at any time in group editors , 2000, CSCW '00.

[56]  Harold W. Thimbleby,et al.  User interface design , 1990, ACM Press Frontier Series.

[57]  Aaron B. Brown Towards Availability and Maintainability Benchmarks: A Case Study of Software RAID Systems , 2001 .

[58]  George Candea,et al.  Recursive restartability: turning the reboot sledgehammer into a scalpel , 2001, Proceedings Eighth Workshop on Hot Topics in Operating Systems.

[59]  Ian T. Foster,et al.  The anatomy of the grid: enabling scalable virtual organizations , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[60]  L. Bainbridge Ironies of Automation , 1982 .

[61]  Alan J. Dix,et al.  Dealing with Undo , 1997, INTERACT.

[62]  Santosh K. Shrivastava,et al.  Reliable Computer Systems , 1985, Texts and Monographs in Computer Science.

[63]  Andreas Reuter,et al.  The ConTract Model , 1991, Database Transaction Models for Advanced Applications.

[64]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[65]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[66]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[67]  Richard M. Stallman,et al.  Gnu Emacs Manual , 1996 .

[68]  Barry H. Kantowitz,et al.  Human Factors: Understanding People-System Relationships , 1983 .

[69]  Marco Vieira,et al.  Benchmarking the dependability of different OLTP systems , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[70]  Charles T. Davies,et al.  Recovery semantics for a DB/DC system , 1973, ACM Annual Conference.

[71]  R. H. Pope Human Performance: What Improvement from Human Reliability Assessment , 1986 .

[72]  Andreas Reuter,et al.  Transaction Processing: Concepts and Techniques , 1992 .

[73]  David A. Patterson,et al.  Lessons from the PSTN for Dependable Computing , 2002 .

[74]  D. Norman Categorization of action slips. , 1981 .

[75]  James Lau,et al.  File System Design for an NFS File Server Appliance , 1994, USENIX Winter.

[76]  Matthias Ressel,et al.  Reducing the problems of group undo , 1999, GROUP.

[77]  Paul P. Maglio,et al.  System administrators are users, too: designing workspaces for managing internet-scale systems , 2003, CHI Extended Abstracts.

[78]  Marti A. Hearst,et al.  The state of the art in automating usability evaluation of user interfaces , 2001, CSUR.

[79]  C. Amza,et al.  Specification and implementation of dynamic Web site benchmarks , 2002, 2002 IEEE International Workshop on Workload Characterization.

[80]  Abraham Silberschatz,et al.  A Formal Approach to Recovery by Compensating Transactions , 1990, VLDB.

[81]  Ralph Johnson,et al.  design patterns elements of reusable object oriented software , 2019 .

[82]  Chengzheng Sun,et al.  Operational transformation in real-time group editors: issues, algorithms, and achievements , 1998, CSCW '98.

[83]  Antony I. T. Rowstron,et al.  The IceCube approach to the reconciliation of divergent replicas , 2001, PODC '01.

[84]  Archana Ganapathi,et al.  Why Do Internet Services Fail, and What Can Be Done About It? , 2002, USENIX Symposium on Internet Technologies and Systems.

[85]  Yiya Yang Undo Support Models , 1988, Int. J. Man Mach. Stud..

[86]  Charles T. Davies,et al.  Data Processing Spheres of Control , 1978, IBM Syst. J..

[87]  Marc Shapiro,et al.  Efficient semantics-aware reconciliation for optimistic write sharing , 2002 .

[88]  Miguel Castro,et al.  BASE: Using abstraction to improve fault tolerance , 2003, TOCS.

[89]  J. Doug Tygar,et al.  Why Johnny Can't Encrypt: A Usability Evaluation of PGP 5.0 , 1999, USENIX Security Symposium.

[90]  C. J. Stone,et al.  A Course in Probability and Statistics , 1995 .

[91]  Timothy J. Slegel,et al.  IBM's S/390 G5 microprocessor design , 1999, IEEE Micro.

[92]  Peter M. Chen,et al.  Exploring failure transparency and the limits of generic recovery , 2000, OSDI.

[93]  Mendel Rosenblum,et al.  The design and implementation of a log-structured file system , 1991, SOSP '91.

[94]  Thomas Berlage,et al.  A selective undo mechanism for graphical user interfaces based on command objects , 1994, TCHI.

[95]  Clayton H. Lewis,et al.  Concepts and implications of undo for interactive recovery , 1985, ACM '85.

[96]  David B. Lomet Persistent applications using generalized redo recovery , 1998, Proceedings 14th International Conference on Data Engineering.

[97]  D. Norman,et al.  New technology and human error , 1989 .

[98]  J. B. Bowles,et al.  High-availability transaction processing: practical experience in availability modeling and analysis , 1998, Annual Reliability and Maintainability Symposium. 1998 Proceedings. International Symposium on Product Quality and Integrity.

[99]  Hamid Pirahesh,et al.  ARIES: a transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging , 1998 .

[100]  Mahadev Satyanarayanan,et al.  Categories and Subject Descriptors: D.4.3 [Software]: File Systems Management—Distributed , 2022 .

[101]  Robert B. Miller,et al.  Response time in man-computer conversational transactions , 1899, AFIPS Fall Joint Computing Conference.

[102]  James T. C. Teng,et al.  E-commerce and the information market , 2001, CACM.

[103]  Brian Berliner,et al.  CVS II: Parallelizing Software Dev elopment , 1998 .

[104]  Thomas K. Landauer,et al.  Research Methods in Human-Computer Interaction , 1988 .

[105]  Lawrence A. Bjork Recovery scenario for a DB/DC system , 1973, ACM Annual Conference.

[106]  Philip Koopman,et al.  Dependability Benchmarking: making choices in an n-dimensional problem space , 2001 .