Banking on decoupling: budget-driven sustainability for HPC applications on auction-based clouds

Cloud providers are auctioning their excess capacity using dynamically priced virtual instances. These spot instances provide significant savings compared to on-demand or fixed price instances. The users willing to use these resources are asked to provide a maximum bid price per hour, and the cloud provider runs the instances as long as the market price is below the user's bid price. By using such resources, the users are exposed explicitly to failures, and need to adapt their applications to provide some level of fault tolerance. In this paper, we expose the effect of bidding in the case of virtual HPC clusters composed of spot instances. We describe the interesting effect of uniform versus non-uniform bidding in terms of both the failure rate and the failure model. We propose an initial attempt to deal with the problem of predicting the runtime of a parallel application under various bidding strategies and various system parameters. We describe the relationship between bidding strategies and programming models, and we build a preliminary optimization model that uses real price traces from Amazon Web Services as inputs, as well as instrumented values related to the processing and network capacities of cluster instances on the EC2 services. Our results show preliminary insights into the relationship between non-uniform bidding and application scaling strategies.

[1]  Jack J. Dongarra,et al.  FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World , 2000, PVM/MPI.

[2]  Miron Livny,et al.  Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System , 1997 .

[3]  Georg Stellner,et al.  CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.

[4]  Artur Andrzejak,et al.  Decision Model for Cloud Computing under SLA Constraints , 2010, 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[5]  Harrick M. Vin,et al.  Egida: an extensible toolkit for low-overhead fault-tolerance , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[6]  Artur Andrzejak,et al.  Reducing Costs of Spot Instances via Checkpointing in the Amazon Elastic Compute Cloud , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[7]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Roy Friedman,et al.  Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[9]  Muli Ben-Yehuda,et al.  The Resource-as-a-Service (RaaS) Cloud , 2012, HotCloud.

[10]  Justin Y. Shi,et al.  Decoupling as a Foundation for Large Scale Parallel Computing , 2009, 2009 11th IEEE International Conference on High Performance Computing and Communications.

[11]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[12]  Asser N. Tantawi,et al.  See Spot Run: Using Spot Instances for MapReduce Workflows , 2010, HotCloud.

[13]  Abdallah Khreishah,et al.  Program Scalability Analysis for HPC Cloud: Applying Amdahl's Law to NAS Benchmarks , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[14]  Rajkumar Buyya,et al.  Provisioning Spot Market Cloud Resources to Create Cost-Effective Virtual Clusters , 2011, ICA3PP.

[15]  Andrew Lumsdaine,et al.  The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[16]  R. Buyya,et al.  Comprehensive Statistical Analysis and Modeling of Spot Instances in Public Cloud Environments , 2011 .

[17]  Abdallah Khreishah,et al.  SpotMPI: A Framework for Auction-Based HPC Computing Using Amazon Spot Instances , 2011, ICA3PP.

[18]  Muli Ben-Yehuda,et al.  Deconstructing Amazon EC2 Spot Instance Pricing , 2011, CloudCom.

[19]  Abdallah Khreishah,et al.  Resource Planning for Parallel Processing in the Cloud , 2011, 2011 IEEE International Conference on High Performance Computing and Communications.

[20]  Michele Mazzucco,et al.  Achieving Performance and Availability Guarantees with Spot Instances , 2011, 2011 IEEE International Conference on High Performance Computing and Communications.

[21]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[22]  Laxmikant V. Kalé,et al.  FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[23]  Ronald Minnich,et al.  A Network-Failure-Tolerant Message-Passing System for Terascale Clusters , 2002, ICS '02.

[24]  Rajkumar Buyya,et al.  Reliable Provisioning of Spot Instances for Compute-intensive Applications , 2011, 2012 IEEE 26th International Conference on Advanced Information Networking and Applications.