Hedge Your Bets: Optimizing Long-term Cloud Costs by Mixing VM Purchasing Options

Cloud platforms offer the same VMs under many purchasing options that specify different costs and time commitments, such as on-demand, reserved, sustained-use, scheduled reserve, transient, and spot block. In general, the stronger the commitment, i.e., longer and less flexible, the lower the price. However, longer and less flexible time commitments can increase cloud costs for users if future workloads cannot utilize the VMs they committed to buying. Large cloud customers often find it challenging to choose the right mix of purchasing options to reduce their long-term costs, while retaining the ability to adjust capacity up and down in response to workload variations.To address the problem, we design policies to optimize long-term cloud costs by selecting a mix of VM purchasing options based on short- and long-term expectations of workload utilization. We consider a batch trace spanning 4 years from a large shared cluster for a major state University system that includes 14k cores and 60 million job submissions, and evaluate how these jobs could be judiciously executed using cloud servers using our approach. Our results show that our policies incur a cost within 41% of an optimistic optimal offline approach, and 50% less than solely using on-demand VMs.

[1]  George Kesidis,et al.  Using Burstable Instances in the Public Cloud , 2017, Proc. ACM Meas. Anal. Comput. Syst..

[2]  David E. Irwin,et al.  Transient Guarantees: Maximizing the Value of Idle Cloud Capacity , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Ion Stoica,et al.  Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics , 2016, NSDI.

[4]  Zhengping Qian,et al.  Pado: A Data Processing Engine for Harnessing Transient Resources in Datacenters , 2017, EuroSys.

[5]  Tim Menzies,et al.  Arrow: Low-Level Augmented Bayesian Optimization for Finding the Best Cloud VM , 2017, 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS).

[6]  Prateek Sharma,et al.  SpotOn: a batch computing service for the spot market , 2015, SoCC.

[7]  Liang Zheng,et al.  On the Viability of a Cloud Virtual Service Provider , 2016, SIGMETRICS.

[8]  Randy H. Katz,et al.  Selecting the best VM across multiple public clouds: a data-driven performance modeling approach , 2017, SoCC.

[9]  Yang Chen,et al.  TR-Spark: Transient Computing for Big Data Analytics , 2016, SoCC.

[10]  Xin He,et al.  Flint: batch-interactive data-intensive processing on transient servers , 2016, EuroSys.

[11]  Minlan Yu,et al.  CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics , 2017, NSDI.

[12]  Christina Delimitrou,et al.  HCloud: Resource-Efficient Provisioning in Shared Cloud Systems , 2016, ASPLOS.

[13]  Prateek Sharma,et al.  Portfolio-driven Resource Management for Transient Cloud Servers , 2017, SIGMETRICS.

[14]  Prateek Sharma,et al.  Here Today, Gone Tomorrow: Exploiting Transient Servers in Datacenters , 2014, IEEE Internet Computing.