Saline: Improving Best-Effort Job Management in Grids

Virtualization technologies have recently gained a lot of interest in Grid computing as they allow flexible resource management. However, the most common way to exploit grids relies on dedicated services like resource management systems (RMSs) to get resources at a particular time. To improve resource usage, most of these systems provide a best-effort mode where lowest priority jobs can be executed when resources are idle. This particular mode does not provide any guarantee of service and jobs may be killed at any time by the RMS when the nodes they use are subject to higher priority reservations. This behaviour potentially leads to a huge waste of computation time or at least requires users to deal with checkpoints of their jobs. In this paper we present Saline, a generic and non-intrusive framework to manage best-effort jobs at grid level through virtual machines (VMs) usage. We discuss the main challenges concerning the design of such a grid system, focusing on VM snapshot management and network configuration. Results of experiments show our proposal ensures an efficient execution of best-effort jobs through the whole grid.

[1]  Georges Da Costa,et al.  2005 IEEE International Symposium on Cluster Computing and the Grid , 2005, CCGRID.

[2]  Wesley Emeneker,et al.  Increasing Reliability through Dynamic Virtual Clustering , 2006 .

[3]  Henri Casanova,et al.  Resource Allocation Using Virtual Clusters , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[4]  Xuxian Jiang,et al.  VIOLIN: Virtual Internetworking on Overlay Infrastructure , 2004, ISPA.

[5]  Bernd Freisleben,et al.  Xen and the Art of Cluster Scheduling , 2006, First International Workshop on Virtualization Technology in Distributed Computing (VTDC 2006).

[6]  Ricardo Bianchini,et al.  Dynamic cluster reconfiguration for power and performance , 2003 .

[7]  Dongyan Xu,et al.  VioCluster: Virtualization for Dynamic Computational Domains , 2005, 2005 IEEE International Conference on Cluster Computing.

[8]  Eyal de Lara,et al.  SnowFlock: rapid virtual machine cloning for cloud computing , 2009, EuroSys '09.

[9]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[10]  Srinidhi Varadarajan,et al.  DejaVu: transparent user-level checkpointing, migration and recovery for distributed systems , 2006, SC.

[11]  Xavier Lorca,et al.  Entropy: a consolidation manager for clusters , 2009, VEE '09.

[12]  Richard Wolski,et al.  The Eucalyptus Open-Source Cloud-Computing System , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[13]  Srinidhi Varadarajan,et al.  DejaVu: Transparent User-Level Checkpointing, Migration, and Recovery for Distributed Systems , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[14]  Jason Duell,et al.  Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters , 2006 .

[15]  Franck Cappello,et al.  Grid'5000: A Large Scale And Highly Reconfigurable Experimental Grid Testbed , 2006, Int. J. High Perform. Comput. Appl..

[16]  Phil Andrews,et al.  Impact of Reservations on Production Job Scheduling , 2007, JSSPP.

[17]  Mathieu Jan,et al.  H IPCAL : State of the Art of OS and Network virtualization solutions for Grids , 2007 .

[18]  Olivier Richard,et al.  TakTuk, adaptive deployment of remote executions , 2009, HPDC '09.

[19]  Mahmut T. Kandemir,et al.  Reducing power with performance constraints for parallel sparse applications , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[20]  Dror G. Feitelson,et al.  Backfilling with Lookahead to Optimize the Performance of Parallel Job Scheduling , 2003, JSSPP.

[21]  Ian T. Foster,et al.  Virtual Workspaces in the Grid , 2005, Euro-Par.

[22]  Alexandru Iosup,et al.  The Grid Workloads Archive , 2008, Future Gener. Comput. Syst..

[23]  Irfan Habib,et al.  Tools and Techniques for Managing Virtual Machine Images , 2008, Euro-Par Workshops.

[24]  Andrew Warfield,et al.  Xen and the art of virtualization , 2003, SOSP '03.

[25]  Douglas Thain,et al.  Distributed computing in practice: the Condor experience , 2005, Concurr. Pract. Exp..

[26]  Katarzyna Keahey,et al.  Contextualization: Providing One-Click Virtual Clusters , 2008, 2008 IEEE Fourth International Conference on eScience.

[27]  Borja Sotomayor,et al.  Combining batch execution and leasing using virtual machines , 2008, HPDC '08.

[28]  Christine Morin,et al.  VMdeploy: Improving Best-Effort Job Management in Grid'5000 , 2008 .