VNsnap: Taking snapshots of virtual networked environments with minimal downtime

A virtual networked environment (VNE) consists of virtual machines (VMs) connected by a virtual network. It has been adopted to create “virtual infrastructures” for individual users on a shared cloud computing infrastructure. The ability to take snapshots of an entire VNE — including images of the VMs with their execution, communication and storage states — yields a unique approach to reliability as a snapshot can restore the operation of an entire virtual infrastructure. We present VNsnap, a system that takes distributed snapshots of VNEs. Unlike existing distributed snapshot/checkpointing solutions, VNsnap does not require any modifications to the applications, libraries, or (guest) operating systems running in the VMs. Furthermore, VNsnap incurs only seconds of downtime as much of the snapshot operation takes place concurrently with the VNE's normal operation. We have implemented VNsnap on top of Xen. Our experiments with real-world parallel and distributed applications demonstrate VNsnap's effectiveness and efficiency.

[1]  Suresh Jagannathan,et al.  Improving duplicate elimination in storage systems , 2006, TOS.

[2]  Hua Zhong,et al.  CRAK: Linux Checkpoint/Restart As a Kernel Module , 1996 .

[3]  Christian Engelmann,et al.  Proactive fault tolerance for HPC with Xen virtualization , 2007, ICS '07.

[4]  Helen J. Wang,et al.  Virtual Playgrounds for Worm Behavior Investigation , 2005, RAID.

[5]  Patrick Th. Eugster,et al.  Taking Snapshots of Virtual Networked Environments , 2007, Proceedings of the 2nd International Workshop on Virtualization Technology in Distributed Computing (VTDC '07).

[6]  Fabrizio Petrini,et al.  Transparent system-level migration of PGAS applications using Xen on InfiniBand , 2007, 2007 IEEE International Conference on Cluster Computing.

[7]  Andrew Warfield,et al.  Live migration of virtual machines , 2005, NSDI.

[8]  Dutch T. Meyer,et al.  Parallax: virtual disks for virtual machines , 2008, Eurosys '08.

[9]  W. Richard Stevens Tcp/ip illustrated- volume 1 , 1994 .

[10]  Andrew Warfield,et al.  Parallax: Managing Storage for a Million Machines , 2005, HotOS.

[11]  Dutch T. Meyer,et al.  Remus: High Availability via Asynchronous Virtual Machine Replication. (Best Paper) , 2008, NSDI.

[12]  Andrea Clematis,et al.  CPVM-extending PVM for consistent checkpointing , 1996, Proceedings of 4th Euromicro Workshop on Parallel and Distributed Processing.

[13]  Rudolf Eigenmann,et al.  Failure-aware checkpointing in fine-grained cycle sharing systems , 2007, HPDC '07.

[14]  Xuxian Jiang,et al.  vBET: a VM-based emulation testbed , 2003, MoMeTools '03.

[15]  Mike Hibler,et al.  Transparent checkpoints of closed distributed systems in Emulab , 2009, EuroSys '09.

[16]  Jason Nieh,et al.  Transparent Checkpoint-Restart of Distributed Applications on Commodity Clusters , 2005, 2005 IEEE International Conference on Cluster Computing.

[17]  Jack J. Dongarra,et al.  FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World , 2000, PVM/MPI.

[18]  Kai Li,et al.  CLIP: A Checkpointing Tool for Message Passing Parallel Programs , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[19]  Mike Hibler,et al.  An integrated experimental environment for distributed systems and networks , 2002, OPSR.

[20]  Xuxian Jiang,et al.  VIOLIN: Virtual Internetworking on Overlay Infrastructure , 2004, ISPA.

[21]  Jason Duell,et al.  The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..

[22]  Randy H. Katz,et al.  Above the Clouds: A Berkeley View of Cloud Computing , 2009 .

[23]  Jason Nieh,et al.  Proceedings of the 5th Symposium on Operating Systems Design and Implementation , 2022 .

[24]  Friedemann Mattern,et al.  Efficient Algorithms for Distributed Snapshots and Global Virtual Time Approximation , 1993, J. Parallel Distributed Comput..

[25]  Srinidhi Varadarajan,et al.  DejaVu: transparent user-level checkpointing, migration and recovery for distributed systems , 2006, SC.