CARP: Handling Silent Data Errors and Site Failures in an Integrated Program and Storage Replication Mechanism

This paper presents CARP, an integrated program and storage replication solution. CARP extends program replication systems which do not currently address storage errors, builds upon a record-and-replay scheme that handles nondeterminism in program execution, and uses a scheme based on recorded program state and I/O logs to enable efficient detection of silent data errors and efficient recovery from such errors. CARP is designed to be transparent to applications with minimal run-time impact and is general enough to be implemented on commodity machines. We implemented CARP as a prototype on the Linux operating system and conducted extensive sensitivity analysis of its overhead with different application profiles and system parameters. In particular, we evaluated CARP with standard unmodified email, database, and web server benchmarks and showed that it imposes acceptable overhead while providing sub-second program state recovery times on detecting a silent data error.

[1]  Miguel Castro,et al.  Practical byzantine fault tolerance and proactive recovery , 2002, TOCS.

[2]  Mark Russinovich,et al.  Replay for concurrent non-deterministic shared-memory applications , 1996, PLDI '96.

[3]  Andrea C. Arpaci-Dusseau,et al.  An analysis of data corruption in the storage stack , 2008, TOS.

[4]  Marc Vertes,et al.  Fault Tolerance in Multiprocessor Systems Via Application Cloning , 2007, 27th International Conference on Distributed Computing Systems (ICDCS '07).

[5]  Miguel Oom Temudo de Castro,et al.  Practical Byzantine fault tolerance , 1999, OSDI '99.

[6]  Eduardo Pinheiro,et al.  Failure Trends in a Large Disk Drive Population , 2007, FAST.

[7]  Lisa Spainhower,et al.  Commercial fault tolerance: a tale of two systems , 2004, IEEE Transactions on Dependable and Secure Computing.

[8]  Jeff Bonwick Zfs , 2007, LISA.

[9]  David Mosberger,et al.  httperf—a tool for measuring web server performance , 1998, PERV.

[10]  Ravishankar K. Iyer,et al.  A preemptive deterministic scheduling algorithm for multithreaded replicas , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[11]  Shankar Pasupathy,et al.  An analysis of latent sector errors in disk drives , 2007, SIGMETRICS '07.

[12]  James Lee Hafner,et al.  Undetected disk errors in RAID arrays , 2008, IBM J. Res. Dev..

[13]  GhemawatSanjay,et al.  The Google file system , 2003 .

[14]  Dutch T. Meyer,et al.  Remus: High Availability via Asynchronous Virtual Machine Replication. (Best Paper) , 2008, NSDI.

[15]  Roy T. Fielding,et al.  The Apache HTTP Server Project , 1997, IEEE Internet Comput..