On-demand, Spot, or Both: Dynamic Resource Allocation for Executing Batch Jobs in the Cloud

Cloud computing provides an attractive computing paradigm in which computational resources are rented on-demand to users with zero capital and maintenance costs. Cloud providers offer different pricing options to meet computing requirements of a wide variety of applications. An attractive option for batch computing is spot-instances, which allows users to place bids for spare computing instances and rent them at a (often) substantially lower price compared to the fixed on-demand price. However, this raises three main challenges for users: how many instances to rent at any time? what type (on-demand, spot, or both)? and what bid value to use for spot instances? In particular, renting on-demand risks high costs while renting spot instances risks job interruption and delayed completion when the spot market price exceeds the bid. This paper introduces an online learning algorithm for resource allocation to address this fundamental tradeoff between computation cost and performance. Our algorithm dynamically adapts resource allocation by learning from its performance on prior job executions while incorporating history of spot prices and workload characteristics. We provide theoretical bounds on its performance and prove that the average regret of our approach (compared to the best policy in hindsight) vanishes to zero with time. Evaluation on traces from a large datacenter cluster shows that our algorithm outperforms greedy allocation heuristics and quickly converges to a small set of best performing policies.

[1]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[2]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[3]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[4]  Darrell D. E. Long,et al.  Adaptive disk spin‐down for mobile computers , 2000, Mob. Networks Appl..

[5]  Scott A. Brandt,et al.  Adaptive Caching by Refetching , 2002, NIPS.

[6]  Scott A. Brandt,et al.  ACME: Adaptive Caching Using Multiple Experts , 2002, WDAS.

[7]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[8]  Shie Mannor,et al.  Online Learning with Expert Advice and Finite-Horizon Constraints , 2008, AAAI.

[9]  Assaf Schuster,et al.  GridBot: execution of bags of tasks in multiple grids , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[10]  David A. Maltz,et al.  Data center TCP (DCTCP) , 2010, SIGCOMM 2010.

[11]  Artur Andrzejak,et al.  Decision Model for Cloud Computing under SLA Constraints , 2010, 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[12]  Marty Humphrey,et al.  Auto-scaling to minimize cost and meet application deadlines in cloud workflows , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[13]  Rajkumar Buyya,et al.  Statistical Modeling of Spot Instance Prices in Public Cloud Environments , 2011, 2011 Fourth IEEE International Conference on Utility and Cloud Computing.

[14]  Raouf Boutaba,et al.  Dynamic Resource Allocation for Spot Markets in Clouds , 2011, Hot-ICE.

[15]  Muli Ben-Yehuda,et al.  Deconstructing Amazon EC2 Spot Instance Pricing , 2011, CloudCom.

[16]  Artur Andrzejak,et al.  Monetary Cost-Aware Checkpointing and Migration on Amazon Cloud Spot Instances , 2012, IEEE Transactions on Services Computing.

[17]  Yang Song,et al.  Optimal bidding in spot instance market , 2012, 2012 Proceedings IEEE INFOCOM.

[18]  Joseph Naor,et al.  Near-optimal scheduling mechanisms for deadline-sensitive jobs in large computing clusters , 2012, SPAA '12.

[19]  András György,et al.  Online Learning under Delayed Feedback , 2013, ICML.

[20]  Patrizio Dazzi,et al.  Proceedings of the first ACM workshop on Optimization techniques for resources management in clouds , 2013, HPDC 2013.

[21]  Thilo Kielmann,et al.  Fast (re-)configuration of mixed on-demand and spot instance pools for high-throughput computing , 2013, ORMaCloud '13.

[22]  Alexandru Iosup,et al.  Scheduling Jobs in the Cloud Using On-Demand and Reserved Instances , 2013, Euro-Par.

[23]  Nikhil R. Devanur,et al.  Cloud scheduling with setup cost , 2013, SPAA.

[24]  Near-Optimal Scheduling Mechanisms for Deadline-Sensitive Jobs in Large Computing Clusters , 2015, ACM Trans. Parallel Comput..