Hypervisor-assisted application checkpointing in virtualized environments

There are two broad categories of approaches used for checkpointing: application-transparent and application-assisted. Typically, application-assisted approaches provide a more flexible and light-weight mechanism but require changes to the application. Although most applications run well under virtualization (e.g. Xen which is being adopted widely), the addition of application-assisted checkpointing - used for high availability - causes performance problems. This is due to the overhead of key system calls used by the checkpointing techniques under virtualization. To overcome this, we introduce the notion of hypervisor-assisted application checkpointing with no changes to the guest operating system. We present the design and a Xen-based implementation of our family of application checkpointing techniques. Our experiments show performance improvements of 4× to 13× in the primitives used for supporting high availability compared to purely user-level approaches.

[1]  Rudy Lauwereins,et al.  A User-triggered Checkpointing Library for Computationintensive Applications , 1995, Parallel and Distributed Computing and Systems.

[2]  Ravishankar K. Iyer,et al.  Checkpointing virtual machines against transient errors , 2010, 2010 IEEE 16th International On-Line Testing Symposium.

[3]  田村 芳明,et al.  Kemari: Virtual Machine Synchronization for Fault Tolerance , 2010 .

[4]  Mark Allen Weiss,et al.  Data structures and algorithm analysis , 1991 .

[5]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[6]  Mark Allen Weiss,et al.  Data structures and algorithm analysis, Second Edition , 1994 .

[7]  Hua Zhong,et al.  CRAK: Linux Checkpoint/Restart As a Kernel Module , 1996 .

[8]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[9]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[10]  Andrea C. Arpaci-Dusseau,et al.  Geiger: monitoring the buffer cache in a virtual machine environment , 2006, ASPLOS XII.

[11]  Luís Moura Silva,et al.  System-level versus user-defined checkpointing , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).

[12]  Andrew W. Appel,et al.  Virtual memory primitives for user programs , 1991, ASPLOS IV.

[13]  Yi-Min Wang,et al.  Checkpointing and its applications , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[14]  Jose Renato Santos,et al.  Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Operating Systems , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[15]  Andrew Warfield,et al.  Live migration of virtual machines , 2005, NSDI.

[16]  Gene Cooperman,et al.  DMTCP: Transparent checkpointing for cluster computations and the desktop , 2007, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[17]  Andrea C. Arpaci-Dusseau,et al.  Antfarm: Tracking Processes in a Virtual Machine Environment , 2006, USENIX Annual Technical Conference, General Track.

[18]  Jason Nieh,et al.  Transparent Checkpoint-Restart of Multiple Processes on Commodity Operating Systems , 2007, USENIX Annual Technical Conference.

[19]  Dutch T. Meyer,et al.  Remus: High Availability via Asynchronous Virtual Machine Replication. (Best Paper) , 2008, NSDI.

[20]  Yookun Cho,et al.  Space-efficient page-level incremental checkpointing , 2005, SAC '05.

[21]  Miron Livny,et al.  Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System , 1997 .

[22]  Fred B. Schneider,et al.  Hypervisor-based fault tolerance , 1996, TOCS.