Flexible resource allocation for reliable virtual cluster computing systems

Virtualization and cloud computing technologies now make it possible to create scalable and reliable virtual high performance computing clusters. Integrating these technologies, however, is complicated by fundamental and inherent differences in the way in which these systems allocate resources to computational tasks. Cloud computing systems immediately allocate available resources or deny requests. In contrast, parallel computing systems route all requests through a queue for future resource allocation. This divergence of allocation policies hinders efforts to implement efficient, responsive, and reliable virtual clusters. In this paper, we present a continuum of four scheduling polices along with an analytical resource prediction model for each policy to estimate the level of resources needed to operate an efficient, responsive, and reliable virtual cluster system. We show that it is possible to estimate the size of the virtual cluster system needed to provide a predictable grade of service for a realistic high performance computing workload and estimate the queue wait time for a partial or full resource allocation. Moreover, we show that it is possible to provide a reliable virtual cluster system using a limited pool of spare resources. The models and results we present are useful for cloud computing providers seeking to operate efficient and cost-effective virtual cluster systems.

[1]  Michael McLennan,et al.  HUBzero: A Platform for Dissemination and Collaboration in Computational Science and Engineering , 2010, Computing in Science & Engineering.

[2]  Richard Wolski,et al.  Automatic methods for predicting machine availability in desktop Grid and peer-to-peer systems , 2004, IEEE International Symposium on Cluster Computing and the Grid, 2004. CCGrid 2004..

[3]  Peter Bajorski,et al.  Wiley Series in Probability and Statistics , 2010 .

[4]  Robert L. Henderson,et al.  Job Scheduling Under the Portable Batch System , 1995, JSSPP.

[5]  Thomas J. Hacker,et al.  Implementing an industrial-strength academic cyberinfrastructure at Purdue University , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[6]  J. Kaufman,et al.  Blocking in a Shared Resource Environment , 1981, IEEE Trans. Commun..

[7]  Jack Dongarra,et al.  TOP500 Supercomputer sites 11/2000 , 2000 .

[8]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[9]  Phuoc Tran-Gia,et al.  An Analysis of Multi-Service Systems with Trunk Reservation Mechanisms , 1992 .

[10]  Daniel Nurmi,et al.  Quantifying Machine Availability in Networked and Desktop Grid Systems , 2004 .

[11]  Derek Ray,et al.  A Primer of Reliability Theory , 1990 .

[12]  Thomas J. Hacker,et al.  Using queue structures to improve job reliability , 2007, HPDC '07.

[13]  Nancy Wilkins-Diehr,et al.  Special Issue: Science Gateways—Common Community Interfaces to Grid Resources , 2007, Concurr. Comput. Pract. Exp..

[14]  Sheldon M. Ross,et al.  Introduction to Probability Models (4th ed.). , 1990 .

[15]  Hui Li,et al.  Workload Characteristics of a Multi-cluster Supercomputer , 2004, JSSPP.

[16]  Christopher D. Carothers,et al.  An analysis of clustered failures on large supercomputing systems , 2009, J. Parallel Distributed Comput..

[17]  K. Kennedy,et al.  Evaluation of a Workflow Scheduler Using Integrated Performance Modelling and Batch Queue Wait Time Prediction , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[18]  Richard Wolski,et al.  Modeling Machine Availability in Enterprise and Wide-Area Distributed Computing Environments , 2005, Euro-Par.

[19]  Nancy Wilkins-Diehr,et al.  Special Issue: Science Gateways—Common Community Interfaces to Grid Resources: Editorials , 2007 .

[20]  T. P. Ryan,et al.  System Reliability Theory: Models, Statistical Methods, and Applications, Second Edition , 2005 .

[21]  Rajkumar Buyya,et al.  Evaluating the cost-benefit of using cloud computing to extend the capacity of clusters , 2009, HPDC '09.

[22]  Erich Strohmaier,et al.  TOP500 supercomputer , 2006, SC.

[23]  Thomas J. Hacker,et al.  Live Migration of Parallel Applications with OpenVZ , 2011, 2011 IEEE Workshops of International Conference on Advanced Information Networking and Applications.

[24]  Warren Smith A service for queue prediction and job statistics , 2010, 2010 Gateway Computing Environments Workshop (GCE).

[25]  R. Buyya,et al.  OpenPEX: An Open Provisioning and EXecution System for Virtual Machines , 2009 .

[26]  Hans Werner Meuer,et al.  Top500 Supercomputer Sites , 1997 .

[27]  Thomas J. Hacker,et al.  A Methodology for Account Management in Grid Computing Environments , 2001, GRID.

[28]  Rudolf Eigenmann,et al.  NEES cyberinfrastructure: A foundation for innovative research and education , 2010 .

[29]  Sheldon M. Ross Introduction to Probability Models. , 1995 .

[30]  Borja Sotomayor,et al.  Virtual Infrastructure Management in Private and Hybrid Clouds , 2009, IEEE Internet Computing.

[31]  Thomas Bonald,et al.  Insensitive Traffic Models for Communication Networks , 2007, Discret. Event Dyn. Syst..

[32]  Guoping Zeng,et al.  Two common properties of the erlang-B function, erlang-C function, and Engset blocking function , 2003 .

[33]  Thomas Hacker Toward a Reliable Cloud Computing Service , 2010 .

[34]  Richard Wolski,et al.  The Eucalyptus Open-Source Cloud-Computing System , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[35]  Hisashi Kobayashi,et al.  System Modeling and Analysis: Foundations of System Performance Evaluation , 2008 .

[36]  Rudolf Eigenmann,et al.  The NEEShub Cyberinfrastructure for Earthquake Engineering , 2011, Computing in Science & Engineering.

[37]  Marvin Rausand,et al.  System Reliability Theory: Models, Statistical Methods, and Applications , 2003 .

[38]  Ibm Redbooks,et al.  Workload Management With Loadleveler , 2001 .

[39]  Cheng Wu,et al.  AMREF: An Adaptive MapReduce Framework for Real Time Applications , 2010, 2010 Ninth International Conference on Grid and Cloud Computing.

[40]  Borja Sotomayor,et al.  Capacity Leasing in Cloud Systems using the OpenNebula Engine , 2008 .