Modeling The Temporally Constrained Preemptions of Transient Cloud VMs

Transient cloud servers such as Amazon Spot instances, Google Preemptible VMs, and Azure Low-priority batch VMs, can reduce cloud computing costs by as much as 10x, but can be unilaterally preempted by the cloud provider. Understanding preemption characteristics (such as frequency) is a key first step in minimizing the effect of preemptions on application performance, availability, and cost. However, little is understood about temporally constrained preemptions---wherein preemptions must occur in a given time window. We study temporally constrained preemptions by conducting a large scale empirical study of Google's Preemptible VMs (that have a maximum lifetime of 24 hours), develop a new preemption probability model, new model-driven resource management policies, and implement them in a batch computing service for scientific computing workloads. Our statistical and experimental analysis indicates that temporally constrained preemptions are not uniformly distributed but are time-dependent and have a bathtub shape. We find that existing memoryless models and policies are not suitable for temporally constrained preemptions. We develop a new probability model for bathtub preemptions and analyze it through the lens of reliability theory. To highlight the effectiveness of our model, we develop optimized policies for job scheduling and checkpointing. Compared to existing techniques, our model-based policies can reduce the probability of job failure by more than 2x. We also implement our policies as part of a batch computing service for scientific computing applications, which reduces cost by 5x compared to conventional cloud deployments and keeps performance overheads under 3%.

[1]  Yogesh Simmhan,et al.  AutoBoT: Resilient and Cost-Effective Scheduling of a Bag of Tasks on Spot VMs , 2019, IEEE Transactions on Parallel and Distributed Systems.

[2]  Prateek Sharma,et al.  SpotOn: a batch computing service for the spot market , 2015, SoCC.

[3]  Ricardo Bianchini,et al.  Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms , 2017, SOSP.

[4]  Shaojie Tang,et al.  Towards Optimal Bidding Strategy for Amazon EC2 Cloud Spot Instance , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[5]  Vikram Jadhao,et al.  Electrostatics-driven shape transitions in soft shells , 2014, Proceedings of the National Academy of Sciences.

[6]  Yang Song,et al.  Optimal bidding in spot instance market , 2012, 2012 Proceedings IEEE INFOCOM.

[7]  Henri Casanova,et al.  Checkpointing strategies for parallel jobs , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[8]  Yong Meng Teo,et al.  The Impact of User Rationality in Federated Clouds , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[9]  Raouf Boutaba,et al.  Dynamic Resource Allocation for Spot Markets in Clouds , 2011, Hot-ICE.

[10]  Christian Haas,et al.  Deconstructing the 2017 Changes to AWS Spot Market Pricing , 2019, ScienceCloud@HPDC.

[11]  Pramod Bhatotia,et al.  Orchestrating the Deployment of Computations in the Cloud with Conductor , 2012, NSDI.

[12]  Yang Song,et al.  Optimal Bids for Spot VMs in a Cloud for Deadline Constrained Jobs , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[13]  Sasko Ristov,et al.  Performance and Behavior Characterization of Amazon EC2 Spot Instances , 2018, 2018 IEEE 11th International Conference on Cloud Computing (CLOUD).

[14]  Weimin Zheng,et al.  Bidding for Highly Available Services with Low Price in Spot Instance Market , 2015, HPDC.

[15]  Vikram Jadhao,et al.  Ionic structure in liquids confined by dielectric interfaces. , 2015, The Journal of chemical physics.

[16]  Ian Karlin,et al.  LULESH 2.0 Updates and Changes , 2013 .

[17]  Christian Haas,et al.  Predicting Amazon Spot Prices with LSTM Networks , 2018, ScienceCloud@HPDC.

[18]  Prateek Sharma,et al.  Here Today, Gone Tomorrow: Exploiting Transient Servers in Datacenters , 2014, IEEE Internet Computing.

[19]  Sewook Wee,et al.  Debunking Real-Time Pricing in Cloud Computing , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[20]  Gregory R. Ganger,et al.  Proteus: agile ML elasticity through tiered reliability in dynamic resource markets , 2017, EuroSys.

[21]  Yves Robert,et al.  Fault-Tolerance Techniques for High-Performance Computing , 2015 .

[22]  Luís E. T. Rodrigues,et al.  Hourglass: Leveraging Transient Resources for Time-Constrained Graph Processing in the Cloud , 2019, EuroSys.

[23]  David E. Irwin,et al.  HotSpot: automated server hopping in cloud spot markets , 2017, SoCC.

[24]  Liang Zheng,et al.  How to Bid the Cloud , 2015, Comput. Commun. Rev..

[25]  Giuliano Casale,et al.  OptiSpot: minimizing application deployment cost using spot cloud resources , 2016, Cluster Computing.

[26]  Bingsheng He,et al.  Monetary cost optimizations for MPI-based HPC applications on Amazon clouds: checkpoints and replicated execution , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[27]  Xin He,et al.  Flint: batch-interactive data-intensive processing on transient servers , 2016, EuroSys.

[28]  David E. Irwin,et al.  Transient Guarantees: Maximizing the Value of Idle Cloud Capacity , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[29]  Gregory R. Ganger,et al.  Tributary: spot-dancing for elastic services with latency SLOs , 2018, USENIX ATC.

[30]  Asser N. Tantawi,et al.  See Spot Run: Using Spot Instances for MapReduce Workflows , 2010, HotCloud.

[31]  Y. Robert,et al.  Fault-Tolerance Techniques for High-Performance Computing , 2015, Computer Communications and Networks.

[32]  Baochun Li,et al.  A study of pricing for cloud resources , 2013, PERV.

[33]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[34]  G. U. Crevecoeur A model for the integrity assessment of ageing repairable systems , 1993 .

[35]  Martin Schulz,et al.  Exploiting redundancy for cost-effective, time-constrained execution of HPC applications on amazon EC2 , 2014, HPDC '14.

[36]  Abhishek Verma,et al.  Large-scale cluster management at Google with Borg , 2015, EuroSys.

[37]  Prateek Sharma,et al.  How Not to Bid the Cloud , 2016, HotCloud.

[38]  Abdallah Khreishah,et al.  SpotMPI: A Framework for Auction-Based HPC Computing Using Amazon Spot Instances , 2011, ICA3PP.

[39]  Prateek Sharma,et al.  Portfolio-driven Resource Management for Transient Cloud Servers , 2017, SIGMETRICS.

[40]  Rajkumar Buyya,et al.  Statistical Modeling of Spot Instance Prices in Public Cloud Environments , 2011, 2011 Fourth IEEE International Conference on Utility and Cloud Computing.

[41]  Geoffrey C. Fox,et al.  Ions in Nanoconfinement , 2017 .

[42]  Richard Wolski,et al.  Probabilistic Guarantees of Execution Duration for Amazon Spot Instances , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[43]  T. Missov,et al.  Gompertz-Makeham life expectancies: expressions and applications. , 2013, Theoretical population biology.

[44]  Muli Ben-Yehuda,et al.  Deconstructing Amazon EC2 Spot Instance Pricing , 2011, 2011 IEEE Third International Conference on Cloud Computing Technology and Science.

[45]  Richard Wolski,et al.  Providing statistical reliability guarantees in the AWS spot tier , 2016, SpringSim.

[46]  Huan Liu,et al.  Cutting MapReduce Cost with Spot Market , 2011, HotCloud.

[47]  Shijian Li,et al.  Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers , 2020, 2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS).

[48]  G. S. Mudholkar,et al.  Exponentiated Weibull family for analyzing bathtub failure-rate data , 1993 .

[49]  Prateek Sharma,et al.  SpotCheck: designing a derivative IaaS cloud on the spot market , 2015, EuroSys.

[50]  Vikram Jadhao,et al.  Computational studies of shape control of charged deformable nanocontainers. , 2019, Journal of materials chemistry. B.

[51]  Artur Andrzejak,et al.  Reducing Costs of Spot Instances via Checkpointing in the Amazon Elastic Compute Cloud , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[52]  Martin Schulz,et al.  Exploring Traditional and Emerging Parallel Programming Models Using a Proxy Application , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[53]  Ohad Shamir,et al.  On-demand, Spot, or Both: Dynamic Resource Allocation for Executing Batch Jobs in the Cloud , 2014, ICAC.

[54]  Prashant J. Shenoy,et al.  SpotLight: An Information Service for the Cloud , 2016, 2016 IEEE 36th International Conference on Distributed Computing Systems (ICDCS).

[55]  Bogdan Ghit,et al.  Better Safe than Sorry: Grappling with Failures of In-Memory Data Analytics Frameworks , 2017, HPDC.

[56]  Andrew A. Chien,et al.  Information Models: Creating and Preserving Value in Volatile Cloud Resources , 2019, 2019 IEEE International Conference on Cloud Engineering (IC2E).

[57]  Prateek Sharma,et al.  The Price Is (Not) Right: Reflections on Pricing for Transient Cloud Servers , 2019, 2019 28th International Conference on Computer Communication and Networks (ICCCN).