Building Semi-Elastic Virtual Clusters for Cost-Effective HPC Cloud Resource Provisioning

Recent studies have found cloud environments increasingly appealing for executing HPC applications, including tightly coupled parallel simulations. At the same time, while public clouds offer elastic, on-demand resource provisioning and pay-as-you-go pricing, individual users setting up their on-demand virtual clusters may not be able to take full advantage of common cost-saving opportunities, such as reserved instances. In this paper, we propose a Semi-Elastic Cluster (SEC) computing model for organizations to reserve and dynamically resize a virtual cloud-based cluster. We present a set of integrated batch scheduling plus resource scaling strategies uniquely enabled by SEC, as well as an online reserved instance provisioning algorithm based on job history. Our trace-driven simulation results show that such a model has a 61.0 percent cost saving than individual users acquiring and managing cloud resources without causing longer average job wait time. Moreover, to exploit the advantages of different public clouds, we also extend SEC to a multi-cloud environment, where SEC can get a lower cost than on any single cloud. We design and implement a prototype system of the SEC model and evaluate it in terms of management overhead and average job wait time. Experimental results show that the management overhead is negligible with respect to the job wait time.

[1]  Wenguang Chen,et al.  Cloud versus in-house cluster: Evaluating Amazon cluster compute instances for running MPI applications , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[2]  Miao Pan,et al.  Optimal Resource Rental Planning for Elastic Applications in Cloud Market , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[3]  William I. Nowicki,et al.  NFS: Network File System Protocol specification , 1989, RFC.

[4]  Martin Schulz,et al.  Exploiting redundancy for cost-effective, time-constrained execution of HPC applications on amazon EC2 , 2014, HPDC '14.

[5]  Marin Litoiu,et al.  Tracking time-varying parameters in software systems with extended Kalman filters , 2015, CASCON.

[6]  Wenguang Chen,et al.  PHANTOM: predicting performance of parallel applications on large-scale parallel machines using a single node , 2010, PPoPP '10.

[7]  Artur Andrzejak,et al.  Reducing Costs of Spot Instances via Checkpointing in the Amazon Elastic Compute Cloud , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[8]  Wenguang Chen,et al.  Cost-effective cloud HPC resource provisioning by building Semi-Elastic virtual clusters , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[9]  Helen D. Karatza,et al.  Evaluation of gang scheduling performance and cost in a cloud computing system , 2010, The Journal of Supercomputing.

[10]  Barbara Panicucci,et al.  Flexible Distributed Capacity Allocation and Load Redirect Algorithms for Cloud Systems , 2011, 2011 IEEE 4th International Conference on Cloud Computing.

[11]  A. Koehler,et al.  Exponential Smoothing Model Selection for Forecasting , 2006 .

[12]  Steven Hand,et al.  Self-adaptive and self-configured CPU resource provisioning for virtualized servers using Kalman filters , 2009, ICAC '09.

[13]  Wenguang Chen,et al.  Employing Checkpoint to Improve Job Scheduling in Large-Scale Systems , 2012, JSSPP.

[14]  Everette S. Gardner,et al.  Exponential smoothing: The state of the art , 1985 .

[15]  Edward Walker,et al.  Benchmarking Amazon EC2 for High-Performance Scientific Computing , 2008, login Usenix Mag..

[16]  David A. Lifka,et al.  The ANL/IBM SP Scheduling System , 1995, JSSPP.

[17]  Julien Gossa,et al.  Cost-Wait Trade-Offs in Client-Side Resource Provisioning with Elastic Clouds , 2011, 2011 IEEE 4th International Conference on Cloud Computing.

[18]  Spyros Makridakis,et al.  The M3-Competition: results, conclusions and implications , 2000 .

[19]  Martin Schulz,et al.  Exploiting Redundancy and Application Scalability for Cost-Effective, Time-Constrained Execution of HPC Applications on Amazon EC2 , 2016, IEEE Transactions on Parallel and Distributed Systems.

[20]  Bingsheng He,et al.  Reciprocal Resource Fairness: Towards Cooperative Multiple-Resource Fair Sharing in IaaS Clouds , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[21]  Dror G. Feitelson,et al.  Utilization and Predictability in Scheduling the IBM SP2 with Backfilling , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[22]  Rajkumar Buyya,et al.  Statistical Modeling of Spot Instance Prices in Public Cloud Environments , 2011, 2011 Fourth IEEE International Conference on Utility and Cloud Computing.

[23]  Robert L. Winkler,et al.  The accuracy of extrapolation (time series) methods: Results of a forecasting competition , 1982 .

[24]  Wei Wang,et al.  To Reserve or Not to Reserve: Optimal Online Multi-Instance Acquisition in IaaS Clouds , 2013, ICAC.

[25]  Artur Andrzejak,et al.  Decision Model for Cloud Computing under SLA Constraints , 2010, 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[26]  Carlos de Alfonso,et al.  EC3: Elastic Cloud Computing Cluster , 2013, J. Comput. Syst. Sci..

[27]  D. W. Trigg,et al.  Exponential Smoothing with an Adaptive Response Rate , 1967 .

[28]  Cynthia Bailey Lee,et al.  Are User Runtime Estimates Inherently Inaccurate? , 2004, JSSPP.

[29]  Alexandru Iosup,et al.  Scheduling Jobs in the Cloud Using On-Demand and Reserved Instances , 2013, Euro-Par.

[30]  Yasushi Inoguchi,et al.  Performance evaluation of a Green Scheduling Algorithm for energy savings in Cloud computing , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[31]  Rajkumar Buyya,et al.  Provisioning Spot Market Cloud Resources to Create Cost-Effective Virtual Clusters , 2011, ICA3PP.