Dynamically Negotiating Capacity Between On-demand and Batch Clusters

In the era of rapid experimental expansion data analysis needs are rapidly outpacing the capabilities of small institutional clusters and looking to integrate HPC resources into their workflow. We propose one way of reconciling on-demand needs of experimental analytics with the batch managed HPC resources within a system that dynamically moves nodes between an on-demand cluster configured with cloud technology (OpenStack) and a traditional HPC cluster managed by a batch scheduler (Torque). We evaluate this system experimentally both in the context of real-life traces representing two years of a specific institutional need, and via experiments in the context of synthetic traces that capture generalized characteristics of potential batch and on-demand workloads. Our results for the real-life scenario show that our approach could reduce the current investment in on-demand infrastructure by 82% while at the same time improving the mean batch wait time almost by an order of magnitude (8x).

[1]  Paul Rad,et al.  Chameleon: A Scalable Production Testbed for Computer Science Research , 2019, Contemporary High Performance Computing.

[2]  David P. Anderson,et al.  SETI@home: an experiment in public-resource computing , 2002, CACM.

[3]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[4]  Feng Liu,et al.  Elastic job bundling: an adaptive resource request strategy for large-scale parallel applications , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Lavanya Ramakrishnan,et al.  Magellan: experiences from a science cloud , 2011, ScienceCloud '11.

[6]  Gilles Fedak,et al.  SpeQuloS: a QoS service for BoT applications using best effort distributed computing infrastructures , 2012, HPDC '12.

[7]  Kyle Chard,et al.  Globus: A Case Study in Software as a Service for Scientists , 2017 .

[8]  Geoffrey C. Fox,et al.  Conceptualizing a Computing Platform for Science Beyond 2020: To Cloudify HPC, or HPCify Clouds? , 2017, 2017 IEEE 10th International Conference on Cloud Computing (CLOUD).

[9]  Shujia Zhou,et al.  Case study for running HPC applications in public clouds , 2010, HPDC '10.

[10]  Christoforos E. Kozyrakis,et al.  Heracles: Improving resource efficiency at scale , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[11]  Peter A. Dinda,et al.  Minimal-overhead virtualization of a large scale supercomputer , 2011, VEE '11.

[12]  Ioan Raicu,et al.  Understanding the Performance and Potential of Cloud Computing for Scientific Applications , 2017, IEEE Transactions on Cloud Computing.

[13]  Zhou Lei,et al.  The portable batch scheduler and the maui scheduler on linux clusters , 2000 .

[14]  Ian T. Foster,et al.  Cost-Aware Cloud Provisioning , 2015, 2015 IEEE 11th International Conference on e-Science.

[15]  Manish Parashar,et al.  Cloud Paradigms and Practices for Computational and Data-Enabled Science and Engineering , 2013, Computing in Science & Engineering.

[16]  Bingsheng He,et al.  Monetary cost optimizations for MPI-based HPC applications on Amazon clouds: checkpoints and replicated execution , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[17]  Richard Wolski,et al.  Probabilistic Guarantees of Execution Duration for Amazon Spot Instances , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  Abhishek Verma,et al.  Large-scale cluster management at Google with Borg , 2015, EuroSys.

[19]  Martin Schulz,et al.  Exploiting redundancy for cost-effective, time-constrained execution of HPC applications on amazon EC2 , 2014, HPDC '14.

[20]  Ohad Shamir,et al.  On-demand, Spot, or Both: Dynamic Resource Allocation for Executing Batch Jobs in the Cloud , 2014, ICAC.

[21]  Muli Ben-Yehuda,et al.  Deconstructing Amazon EC2 Spot Instance Pricing , 2011, CloudCom.

[22]  Paul Marshall,et al.  Improving Utilization of Infrastructure Clouds , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[23]  Yang Chen,et al.  TR-Spark: Transient Computing for Big Data Analytics , 2016, SoCC.

[24]  Francisco Vilar Brasileiro,et al.  Long-term SLOs for reclaimed cloud computing resources , 2014, SoCC.

[25]  Bogdan Nicolae,et al.  Performance Model of MapReduce Iterative Applications for Hybrid Cloud Bursting , 2018, IEEE Transactions on Parallel and Distributed Systems.

[26]  Wenguang Chen,et al.  Cost-effective cloud HPC resource provisioning by building Semi-Elastic virtual clusters , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[27]  David E. Irwin,et al.  Transient Guarantees: Maximizing the Value of Idle Cloud Capacity , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[28]  Garrick Staples,et al.  TORQUE resource manager , 2006, SC.

[29]  Ion Stoica,et al.  True elasticity in multi-tenant data-intensive compute clusters , 2012, SoCC '12.

[30]  Irfan Habib,et al.  Virtualization with KVM , 2008 .

[31]  Carlo Curino,et al.  Morpheus: Towards Automated SLOs for Enterprise Clusters , 2016, OSDI.