Monetary Cost-Aware Checkpointing and Migration on Amazon Cloud Spot Instances

Recently introduced spot instances in the Amazon Elastic Compute Cloud (EC2) offer low resource costs in exchange for reduced reliability; these instances can be revoked abruptly due to price and demand fluctuations. Mechanisms and tools that deal with the cost-reliability tradeoffs under this schema are of great value for users seeking to lessen their costs while maintaining high reliability. We study how mechanisms, namely, checkpointing and migration, can be used to minimize the cost and volatility of resource provisioning. Based on the real price history of EC2 spot instances, we compare several adaptive checkpointing schemes in terms of monetary costs and improvement of job completion times. We evaluate schemes that apply predictive methods for spot prices. Furthermore, we also study how work migration can improve task completion in the midst of failures while maintaining low monetary costs. Trace-based simulations show that our schemes can reduce significantly both monetary costs and task completion times of computation on spot instance.

[1]  Simson L. Garfinkel,et al.  Commodity Grid Computing with Amazon's S3 and EC2 , 2007, login Usenix Mag..

[2]  Henri Casanova,et al.  UMR: a multi-round algorithm for scheduling divisible workloads , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[3]  Carrie Grimes,et al.  Using a market economy to provision compute resources across planet-wide clusters , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[4]  Franck Cappello,et al.  Cost-benefit analysis of Cloud Computing versus desktop grids , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[5]  Ewa Deelman,et al.  The cost of doing science on the cloud: the Montage example , 2008, HiPC 2008.

[6]  Kai Li,et al.  Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..

[7]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[8]  Artur Andrzejak,et al.  Decision Model for Cloud Computing under SLA Constraints , 2010, 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[9]  Miron Livny,et al.  The cost of doing science on the cloud: The Montage example , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  Yookun Cho,et al.  Adaptive page-level incremental checkpointing based on expected recovery time , 2006, SAC '06.

[11]  Jean-Marc Vincent,et al.  Mining for Availability Models in Large-Scale Distributed Systems:A Case Study of SETI@home , 2009 .

[12]  Andrzej Duda,et al.  The Effects of Checkpointing on Program Execution Time , 1983, Inf. Process. Lett..

[13]  Christian Benjamin Ries Berkeley Open Infrastructure for Network Computing , 2012 .

[14]  David P. Anderson,et al.  On correlated availability in Internet-distributed systems , 2008, 2008 9th IEEE/ACM International Conference on Grid Computing.

[15]  Yookun Cho,et al.  Taking Point Decision Mechanism for Page-level Incremental Checkpointing based on Cost Analysis of Process Execution Time , 2006, J. Inf. Sci. Eng..

[16]  Artur Andrzejak,et al.  Reducing Costs of Spot Instances via Checkpointing in the Amazon Elastic Compute Cloud , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[17]  David P. Anderson,et al.  Exploiting non-dedicated resources for cloud computing , 2010, 2010 IEEE Network Operations and Management Symposium - NOMS 2010.

[18]  Thomas Hérault,et al.  MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[19]  Luís Moura Silva,et al.  Predicting Machine Availabilities in Desktop Pools , 2006, 2006 IEEE/IFIP Network Operations and Management Symposium NOMS 2006.

[20]  Derrick Kondo,et al.  How Checkpointing Can Reduce Cost of Using Clouds , 2011 .

[21]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[22]  Matei Ripeanu,et al.  Amazon S3 for science grids: a viable solution? , 2008, DADC '08.

[23]  Cheng-Zhong Xu,et al.  Exploring event correlation for failure prediction in coalitions of clusters , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).