Portfolio-driven Resource Management for Transient Cloud Servers

Cloud providers have begun to offer their surplus capacity in the form of low-cost transient servers, which can be revoked unilaterally at any time. While the low cost of transient servers makes them attractive for a wide range of applications, such as data processing and scientific computing, failures due to server revocation can severely degrade application performance. Since different transient server types offer different cost and availability tradeoffs, we present the notion of server portfolios that is based on financial portfolio modeling. Server portfolios enable construction of an "optimal" mix of severs to meet an application's sensitivity to cost and revocation risk. We implement model-driven portfolios in a system called ExoSphere, and show how diverse applications can use portfolios and application-specific policies to gracefully handle transient servers. We show that ExoSphere enables widely-used parallel applications such as Spark, MPI, and BOINC to be made transiency-aware with modest effort. Our experiments show that allowing the applications to use suitable transiency-aware policies, ExoSphere is able to achieve 80% cost savings when compared to on-demand servers and greatly reduces revocation risk compared to existing approaches.

[1]  Evgenia Smirni,et al.  Less Can Be More: Micro-managing VMs in Amazon EC2 , 2015, 2015 IEEE 8th International Conference on Cloud Computing.

[2]  Thilo Kielmann,et al.  Fast (re-)configuration of mixed on-demand and spot instance pools for high-throughput computing , 2013, ORMaCloud '13.

[3]  Gregory R. Ganger,et al.  Proteus: agile ML elasticity through tiered reliability in dynamic resource markets , 2017, EuroSys.

[4]  Alexandru Iosup,et al.  Scheduling Jobs in the Cloud Using On-Demand and Reserved Instances , 2013, Euro-Par.

[5]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[6]  Prateek Sharma,et al.  Here Today, Gone Tomorrow: Exploiting Transient Servers in Datacenters , 2014, IEEE Internet Computing.

[7]  Yang Chen,et al.  TR-Spark: Transient Computing for Big Data Analytics , 2016, SoCC.

[8]  Muli Ben-Yehuda,et al.  Deconstructing Amazon EC2 Spot Instance Pricing , 2011, CloudCom.

[9]  Benjamin Farley,et al.  More for your money: exploiting performance heterogeneity in public clouds , 2012, SoCC '12.

[10]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[11]  Johan Tordsson,et al.  An Autonomic Approach to Risk-Aware Data Center Overbooking , 2014, IEEE Transactions on Cloud Computing.

[12]  Christina Delimitrou,et al.  Quasar: resource-efficient and QoS-aware cluster management , 2014, ASPLOS.

[13]  S. C. Myers,et al.  Principles of Corporate Finance - 4/E , 2002 .

[14]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[15]  Minlan Yu,et al.  CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics , 2017, NSDI.

[16]  Francisco Vilar Brasileiro,et al.  Long-term SLOs for reclaimed cloud computing resources , 2014, SoCC.

[17]  Wei Lin,et al.  Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing , 2014, OSDI.

[18]  Michael Abd-El-Malek,et al.  Omega: flexible, scalable schedulers for large compute clusters , 2013, EuroSys '13.

[19]  David P. Anderson,et al.  BOINC: a system for public-resource computing and storage , 2004, Fifth IEEE/ACM International Workshop on Grid Computing.

[20]  Pramod Bhatotia,et al.  Orchestrating the Deployment of Computations in the Cloud with Conductor , 2012, NSDI.

[21]  Abhishek Verma,et al.  Large-scale cluster management at Google with Borg , 2015, EuroSys.

[22]  Rodrigo Fonseca,et al.  Retro: Targeted Resource Management in Multi-tenant Distributed Systems , 2015, NSDI.

[23]  Xin He,et al.  Flint: batch-interactive data-intensive processing on transient servers , 2016, EuroSys.

[24]  Prashant J. Shenoy,et al.  SpotLight: An Information Service for the Cloud , 2016, 2016 IEEE 36th International Conference on Distributed Computing Systems (ICDCS).

[25]  Prateek Sharma,et al.  How Not to Bid the Cloud , 2016, HotCloud.

[26]  H. Markowitz,et al.  The Legacy of Modern Portfolio Theory , 2002 .

[27]  Tad Hogg,et al.  Spawn: A Distributed Computational Economy , 1992, IEEE Trans. Software Eng..

[28]  Prashant J. Shenoy,et al.  Yank: Enabling Green Data Centers to Pull the Plug , 2013, NSDI.

[29]  Christina Delimitrou,et al.  HCloud: Resource-Efficient Provisioning in Shared Cloud Systems , 2016, ASPLOS.

[30]  Giuliano Casale,et al.  OptiSpot: minimizing application deployment cost using spot cloud resources , 2016, Cluster Computing.

[31]  Martin Schulz,et al.  Exploiting redundancy for cost-effective, time-constrained execution of HPC applications on amazon EC2 , 2014, HPDC '14.

[32]  Carlo Curino,et al.  Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters , 2015, USENIX Annual Technical Conference.

[33]  Anne-Marie Kermarrec,et al.  Hawk: Hybrid Datacenter Scheduling , 2015, USENIX Annual Technical Conference.

[34]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[35]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[36]  Prateek Sharma,et al.  SpotOn: a batch computing service for the spot market , 2015, SoCC.

[37]  Benjamin Hindman,et al.  Dominant Resource Fairness: Fair Allocation of Multiple Resource Types , 2011, NSDI.

[38]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[39]  Albert Y. Zomaya,et al.  Tradeoffs Between Profit and Customer Satisfaction for Service Provisioning in the Cloud , 2011, HPDC '11.

[40]  Erik Elmroth,et al.  DieHard: Reliable Scheduling to Survive Correlated Failures in Cloud Data Centers , 2016, 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid).

[41]  Stephen E. Satchell,et al.  A demystification of the Black–Litterman model: Managing quantitative and traditional portfolio construction , 2000 .

[42]  E. Elton Modern portfolio theory and investment analysis , 1981 .

[43]  Tad Hogg,et al.  An Economics Approach to Hard Computational Problems , 1997, Science.

[44]  A. Meucci Risk and asset allocation , 2005 .

[45]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[46]  Prateek Sharma,et al.  SpotCheck: designing a derivative IaaS cloud on the spot market , 2015, EuroSys.

[47]  Artur Andrzejak,et al.  Reducing Costs of Spot Instances via Checkpointing in the Amazon Elastic Compute Cloud , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[48]  Eugenio Gianniti,et al.  D-SPACE4Cloud: A Design Tool for Big Data Applications , 2016, ICA3PP.

[49]  Zhengping Qian,et al.  Pado: A Data Processing Engine for Harnessing Transient Resources in Datacenters , 2017, EuroSys.

[50]  Dawson R. Engler,et al.  Exokernel: an operating system architecture for application-level resource management , 1995, SOSP.