Taking Snapshots of Virtual Networked Environments

The capture of global, consistent snapshots of a distributed computing session or system is essential to the system's reliability, manageability, and accountability. Despite the large body of work at the application, library, and operating system levels, we identify a void in the spectrum of distributed snapshot techniques: taking snapshots of the entire distributed runtime environment. Such capability has unique applicability in a number of application scenarios. In this paper, we realize such capability in the context of virtual networked environments. More specifically, by adapting and implementing a distributed snapshot algorithm, we enable the capture of causally consistent snapshots of virtual machines in a virtual networked environment. The snapshot-taking operations do not require any modification to the applications or operating systems running inside the virtual environment. Preliminary evaluation results indicate that our technique incurs acceptable overhead and small disruption to the normal operation of the virtual environment.

[1]  Xuxian Jiang,et al.  Virtual distributed environments in a shared infrastructure , 2005, Computer.

[2]  Jason Duell,et al.  The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..

[3]  Jack J. Dongarra,et al.  FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World , 2000, PVM/MPI.

[4]  Rudolf Eigenmann,et al.  Failure-aware checkpointing in fine-grained cycle sharing systems , 2007, HPDC '07.

[5]  Jason Nieh,et al.  Transparent Checkpoint-Restart of Distributed Applications on Commodity Clusters , 2005, 2005 IEEE International Conference on Cluster Computing.

[6]  Hua Zhong,et al.  CRAK: Linux Checkpoint/Restart As a Kernel Module , 1996 .

[7]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[8]  Kai Li,et al.  CLIP: A Checkpointing Tool for Message Passing Parallel Programs , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[9]  Fabrizio Petrini,et al.  Transparent system-level migration of PGAS applications using Xen on InfiniBand , 2007, 2007 IEEE International Conference on Cluster Computing.

[10]  Xuxian Jiang,et al.  VIOLIN: Virtual Internetworking on Overlay Infrastructure , 2004, ISPA.

[11]  Renato J. O. Figueiredo,et al.  A case for grid computing on virtual machines , 2003, 23rd International Conference on Distributed Computing Systems, 2003. Proceedings..

[12]  Jason Nieh,et al.  Proceedings of the 5th Symposium on Operating Systems Design and Implementation , 2022 .

[13]  Friedemann Mattern,et al.  Efficient Algorithms for Distributed Snapshots and Global Virtual Time Approximation , 1993, J. Parallel Distributed Comput..

[14]  Helen J. Wang,et al.  Virtual Playgrounds for Worm Behavior Investigation , 2005, RAID.

[15]  Ten-Hwang Lai,et al.  On Distributed Snapshots , 1987, Inf. Process. Lett..