In-network live snapshot service for recovering virtual infrastructures

Infrastructure as a Service (IaaS) has become an increasingly popular type of service for both private and public clouds. The virtual infrastructures that enable IaaS support multitenancy by multiplexing the computational resources of data centers and result in substantial reductions in operational costs. Since hardware and software failures occur on a routine basis in large-scale systems, it is imperative for cloud providers to offer various failure recovery options for distributed services hosted on such infrastructures. In this article we present GENI-VIOLIN, a new cloud capability that can checkpoint a stateful distributed service while incurring very low overhead. The unique aspect of GENI-VIOLIN compared to previous work is that GENI-VIOLIN exploits programmable OpenFlow switches to provide checkpointing services in the network, thereby requiring minimal changes to the end host virtualization framework. We have developed a prototype of GENI-VIOLIN using the GENI infrastructure, and have demonstrated GENI-VIOLIN's checkpoint and restore capability across multiple GENI sites.

[1]  Srinidhi Varadarajan,et al.  DejaVu: transparent user-level checkpointing, migration and recovery for distributed systems , 2006, SC.

[2]  Fabrizio Petrini,et al.  Transparent system-level migration of PGAS applications using Xen on InfiniBand , 2007, 2007 IEEE International Conference on Cluster Computing.

[3]  Dutch T. Meyer,et al.  Remus: High Availability via Asynchronous Virtual Machine Replication. (Best Paper) , 2008, NSDI.

[4]  Chip Elliott,et al.  GENI - global environment for network innovations , 2008, LCN.

[5]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[6]  Jason Nieh,et al.  Transparent Checkpoint-Restart of Distributed Applications on Commodity Clusters , 2005, 2005 IEEE International Conference on Cluster Computing.

[7]  Xuxian Jiang,et al.  VIOLIN: Virtual Internetworking on Overlay Infrastructure , 2004, ISPA.

[8]  Nick McKeown,et al.  OpenFlow: enabling innovation in campus networks , 2008, CCRV.

[9]  Friedemann Mattern,et al.  Efficient Algorithms for Distributed Snapshots and Global Virtual Time Approximation , 1993, J. Parallel Distributed Comput..

[10]  Jason Duell,et al.  The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..

[11]  Mike Hibler,et al.  Transparent checkpoints of closed distributed systems in Emulab , 2009, EuroSys '09.

[12]  Patrick Th. Eugster,et al.  VNsnap: Taking snapshots of virtual networked environments with minimal downtime , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[13]  A. Myers,et al.  Global Environment for Network Innovations , 2005 .