Undo for Operators: Building an Undoable E-mail Store

System operators play a critical role in maintaining server dependability yet lack powerful tools to help them do so. To help address this unfulfilled need, we describe Operator Undo, a tool that provides a forgiving operations environment by allowing operators to recover from their own mistakes, from unanticipated software problems, and from intentional or accidental data corruption. Operator Undo starts by intercepting and logging user interactions with a network service before they enter the system, creating a record of user intent. During an undo cycle, all system hard state is physically rewound, allowing the operator to perform arbitrary repairs; after repairs are complete, lost user data is reintegrated into the repaired system by replaying the logged user interactions while tracking and compensating for any resulting externally-visible inconsistencies. We describe the design and implementation of an application-neutral framework for Operator Undo, and detail the process by which we instantiated the framework in the form of an undo-capable e-mail store supporting SMTP mail delivery and IMAP mail retrieval. Our proof-of-concept e-mail implementation imposes only a small performance overhead, and can store days or weeks of recovery log on a single disk.

[1]  Noah Treuhaft,et al.  Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies , 2002 .

[2]  Jim Gray,et al.  Why Do Computers Stop and What Can Be Done About It? , 1986, Symposium on Reliability in Distributed Software and Database Systems.

[3]  Mahadev Satyanarayanan,et al.  Categories and Subject Descriptors: D.4.3 [Software]: File Systems Management—Distributed , 2022 .

[4]  Wolfgang Graetsch,et al.  Fault tolerance under UNIX , 1989, TOCS.

[5]  Mark R. Crispin Internet Message Access Protocol - Version 4rev1 , 1996, RFC.

[6]  Robert B. Miller,et al.  Response time in man-computer conversational transactions , 1899, AFIPS Fall Joint Computing Conference.

[7]  Miguel Castro,et al.  Using abstraction to improve fault tolerance , 2001, Proceedings Eighth Workshop on Hot Topics in Operating Systems.

[8]  Miguel Castro,et al.  BASE: using abstraction to improve fault tolerance , 2001, SOSP.

[9]  Samuel T. King,et al.  ReVirt: enabling intrusion analysis through virtual-machine logging and replay , 2002, OPSR.

[10]  J. Shaoul Human Error , 1973, Nature.

[11]  Brendan Murphy,et al.  Measuring system and software reliability using an automated data collection process , 1995 .

[12]  Hamid Pirahesh,et al.  ARIES: a transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging , 1998 .

[13]  W. Keith Edwards,et al.  Flexible conflict detection and management in collaborative applications , 1997, UIST '97.

[14]  David A. Patterson,et al.  Rewind, repair, replay: three R's to dependability , 2002, EW 10.

[15]  David A. Patterson,et al.  Including the Human Factor in Dependability Benchmarks , 2002 .

[16]  Petr Jan Horn,et al.  Autonomic Computing: IBM's Perspective on the State of Information Technology , 2001 .

[17]  Dan Boneh,et al.  Revocation of Unread E-mail in an Untrusted Network , 1997, ACISP.

[18]  Marc Shapiro,et al.  Efficient semantics-aware reconciliation for optimistic write sharing , 2002 .

[19]  Peter M. Chen,et al.  Exploring failure transparency and the limits of generic recovery , 2000, OSDI.

[20]  Jonathan B. Postel Rfc821: simple mail transfer protocol , 1982 .

[21]  Marvin Theimer,et al.  Managing update conflicts in Bayou, a weakly connected replicated storage system , 1995, SOSP.

[22]  Antony I. T. Rowstron,et al.  The IceCube approach to the reconciliation of divergent replicas , 2001, PODC '01.

[23]  Archana Ganapathi,et al.  Why Do Internet Services Fail, and What Can Be Done About It? , 2002, USENIX Symposium on Internet Technologies and Systems.

[24]  Elizabeth D. Mynatt,et al.  Timewarp: techniques for autonomous collaboration , 1997, CHI.