VNsnap: Taking Snapshots of Virtual Networked Infrastructures in the Cloud

A virtual networked infrastructure (VNI) consists of virtual machines (VMs) connected by a virtual network. Created for individual users on a shared cloud infrastructure, VNIs reflect the concept of "Infrastructure as a Service” (IaaS) as part of the emerging cloud computing paradigm. The ability to take snapshots of an entire VNI-including images of the VMs with their execution, communication, and storage states-yields a unique approach to reliability as a VNI snapshot can be used to restore the operation of the entire virtual infrastructure. We present VNsnap, a system that takes distributed snapshots of VNIs. Unlike many existing distributed snapshot/checkpointing solutions, VNsnap does not require any modifications to the applications, libraries, or (guest) operating systems (OSs) running in the VMs. Furthermore, by performing much of the snapshot operation concurrently with the VNI's normal operation, VNsnap incurs only seconds of downtime. We have implemented VNsnap on top of Xen. Our experiments with real-world parallel and distributed applications demonstrate VNsnap's effectiveness and efficiency.

[1]  田村 芳明,et al.  Kemari: Virtual Machine Synchronization for Fault Tolerance , 2010 .

[2]  Patrick Th. Eugster,et al.  VNsnap: Taking snapshots of virtual networked environments with minimal downtime , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[3]  Kai Li,et al.  CLIP: A Checkpointing Tool for Message Passing Parallel Programs , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[4]  Jason Duell,et al.  The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..

[5]  Jack J. Dongarra,et al.  FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World , 2000, PVM/MPI.

[6]  Mike Hibler,et al.  Transparent checkpoints of closed distributed systems in Emulab , 2009, EuroSys '09.

[7]  Xuxian Jiang,et al.  VIOLIN: Virtual Internetworking on Overlay Infrastructure , 2004, ISPA.

[8]  Kartik Gopalan,et al.  Post-copy based live virtual machine migration using adaptive pre-paging and dynamic self-ballooning , 2009, VEE '09.

[9]  Christian Engelmann,et al.  Proactive fault tolerance for HPC with Xen virtualization , 2007, ICS '07.

[10]  Srinidhi Varadarajan,et al.  DejaVu: transparent user-level checkpointing, migration and recovery for distributed systems , 2006, SC.

[11]  Eyal de Lara,et al.  SnowFlock: rapid virtual machine cloning for cloud computing , 2009, EuroSys '09.

[12]  Jason Nieh,et al.  Transparent Checkpoint-Restart of Distributed Applications on Commodity Clusters , 2005, 2005 IEEE International Conference on Cluster Computing.

[13]  Carl A. Waldspurger,et al.  Memory resource management in VMware ESX server , 2002, OSDI '02.

[14]  Tzi-cker Chiueh,et al.  Fast memory state synchronization for virtualization-based fault tolerance , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[15]  George Varghese,et al.  Difference engine , 2010, OSDI.

[16]  Patrick Th. Eugster,et al.  Taking Snapshots of Virtual Networked Environments , 2007, Proceedings of the 2nd International Workshop on Virtualization Technology in Distributed Computing (VTDC '07).

[17]  Fabrizio Petrini,et al.  Transparent system-level migration of PGAS applications using Xen on InfiniBand , 2007, 2007 IEEE International Conference on Cluster Computing.

[18]  Mike Hibler,et al.  An integrated experimental environment for distributed systems and networks , 2002, OPSR.

[19]  A. Meyers Reading , 1999, Language Teaching.

[20]  Helen J. Wang,et al.  Virtual Playgrounds for Worm Behavior Investigation , 2005, RAID.

[21]  Andrew Warfield,et al.  Live migration of virtual machines , 2005, NSDI.

[22]  Dutch T. Meyer,et al.  Remus: High Availability via Asynchronous Virtual Machine Replication. (Best Paper) , 2008, NSDI.

[23]  Andrea Clematis,et al.  CPVM-extending PVM for consistent checkpointing , 1996, Proceedings of 4th Euromicro Workshop on Parallel and Distributed Processing.

[24]  Randy H. Katz,et al.  Above the Clouds: A Berkeley View of Cloud Computing , 2009 .

[25]  Jason Nieh,et al.  Proceedings of the 5th Symposium on Operating Systems Design and Implementation , 2022 .

[26]  Friedemann Mattern,et al.  Efficient Algorithms for Distributed Snapshots and Global Virtual Time Approximation , 1993, J. Parallel Distributed Comput..