论文信息 - Undo for Operators: Building an Undoable E-mail Store

Undo for Operators: Building an Undoable E-mail Store

System operators play a critical role in maintaining server dependability yet lack powerful tools to help them do so. To help address this unfulfilled need, we describe Operator Undo, a tool that provides a forgiving operations environment by allowing operators to recover from their own mistakes, from unanticipated software problems, and from intentional or accidental data corruption. Operator Undo starts by intercepting and logging user interactions with a network service before they enter the system, creating a record of user intent. During an undo cycle, all system hard state is physically rewound, allowing the operator to perform arbitrary repairs; after repairs are complete, lost user data is reintegrated into the repaired system by replaying the logged user interactions while tracking and compensating for any resulting externally-visible inconsistencies. We describe the design and implementation of an application-neutral framework for Operator Undo, and detail the process by which we instantiated the framework in the form of an undo-capable e-mail store supporting SMTP mail delivery and IMAP mail retrieval. Our proof-of-concept e-mail implementation imposes only a small performance overhead, and can store days or weeks of recovery log on a single disk.

David A. Patterson | Aaron B. Brown | D. Patterson

[1] Noah Treuhaft,et al. Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies , 2002 .

[2] Jim Gray,et al. Why Do Computers Stop and What Can Be Done About It? , 1986, Symposium on Reliability in Distributed Software and Database Systems.

[3] Mahadev Satyanarayanan,et al. Categories and Subject Descriptors: D.4.3 [Software]: File Systems Management—Distributed , 2022 .

[4] Wolfgang Graetsch,et al. Fault tolerance under UNIX , 1989, TOCS.

[5] Mark R. Crispin. Internet Message Access Protocol - Version 4rev1 , 1996, RFC.

[6] Robert B. Miller,et al. Response time in man-computer conversational transactions , 1899, AFIPS Fall Joint Computing Conference.

[7] Miguel Castro,et al. Using abstraction to improve fault tolerance , 2001, Proceedings Eighth Workshop on Hot Topics in Operating Systems.

[8] Miguel Castro,et al. BASE: using abstraction to improve fault tolerance , 2001, SOSP.

[9] Samuel T. King,et al. ReVirt: enabling intrusion analysis through virtual-machine logging and replay , 2002, OPSR.

[10] J. Shaoul. Human Error , 1973, Nature.

[11] Brendan Murphy,et al. Measuring system and software reliability using an automated data collection process , 1995 .

[12] Hamid Pirahesh,et al. ARIES: a transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging , 1998 .

[13] W. Keith Edwards,et al. Flexible conflict detection and management in collaborative applications , 1997, UIST '97.

[14] David A. Patterson,et al. Rewind, repair, replay: three R's to dependability , 2002, EW 10.