Scheduling of online compute-intensive synchronized jobs on high performance virtual clusters

Abstract This paper presents a high performance technique for virtualization-unaware scheduling of compute-intensive synchronized (i.e., tightly-coupled) jobs in virtualized high performance computing systems. Online tightly-coupled jobs are assigned/reassigned to clustered virtual machines based on synchronization costs. Virtual machines are in turn assigned/reassigned to clustered physical machines based on CPU load. Our analytical study shows that it is possible to minimize the performance and scalability degradation of high performance computing applications such as ExaScale and PetaScale systems and applications that are recommended to use virtualization technology to achieve higher degree of performability, namely higher utilization, energy efficiency, portability, flexibility and configurability.

[1]  Michael T. Heath A tale of two laws , 2015, Int. J. High Perform. Comput. Appl..

[2]  Peter A. Dinda,et al.  Virtual-machine-based emulation of future generation high-performance computing systems , 2012, Int. J. High Perform. Comput. Appl..

[3]  Gilad Kutiel,et al.  Cost-aware live migration of services in the cloud , 2010, SYSTOR '10.

[4]  Amnon Barak,et al.  Opportunity Cost Algorithms for Reduction of I/O and Interprocess Communication Overhead in a Computing Cluster , 2003, IEEE Trans. Parallel Distributed Syst..

[5]  Thomas L. Sterling The Biggest Need: a New Model of Computation , 2009, Int. J. High Perform. Comput. Appl..

[6]  Uwe Schwiegelshohn,et al.  A system-centric metric for the evaluation of online job schedules , 2011, J. Sched..

[7]  Henri Casanova,et al.  Versatile, scalable, and accurate simulation of distributed applications and platforms , 2014, J. Parallel Distributed Comput..

[8]  Adam Wierman,et al.  Hopper: Decentralized Speculation-aware Cluster Scheduling at Scale , 2015, SIGCOMM.

[9]  Peter A. Dinda,et al.  Minimal-overhead virtualization of a large scale supercomputer , 2011, VEE '11.

[10]  David S. Rosenblum,et al.  A framework for characterization and analysis of software system scalability , 2007, ESEC-FSE '07.

[11]  Baruch Awerbuch,et al.  An Opportunity Cost Approach for Job Assignment in a Scalable Computing Cluster , 2000, IEEE Trans. Parallel Distributed Syst..

[12]  Jemal H. Abawajy,et al.  An efficient adaptive scheduling policy for high-performance computing , 2009, Future Gener. Comput. Syst..

[13]  Kevin Pedretti,et al.  Opportunities for leveraging OS virtualization in high-end supercomputing. , 2010 .

[14]  Anja Strunk Costs of Virtual Machine Live Migration: A Survey , 2012, 2012 IEEE Eighth World Congress on Services.

[15]  Azzedine Boukerche,et al.  A scheduling and load balancing scheme for dynamic P2P-based system , 2010 .

[16]  Brian Whitworth,et al.  The web of system performance , 2006, CACM.

[17]  Baruch Awerbuch,et al.  An Opportunity Cost Approach for Job Assignment and Reassignment in a Scalable Computing Cluster , 2002 .

[18]  Thomas L. Sterling Models of Computation — Enabling Exascale , 2009, Int. J. High Perform. Comput. Appl..

[19]  Kevin T. Pedretti,et al.  Achieving Performance Isolation with Lightweight Co-Kernels , 2015, HPDC.

[20]  Joshua E. Simons,et al.  Virtualizing high performance computing , 2010, OPSR.

[21]  William Gropp,et al.  Beowulf Cluster Computing with Linux , 2003 .

[22]  David Kang,et al.  A systematic approach in managing post-deployment system changes , 2006, Commun. ACM.

[23]  Thomas L. Sterling,et al.  ParalleX An Advanced Parallel Execution Model for Scaling-Impaired Applications , 2009, 2009 International Conference on Parallel Processing Workshops.

[24]  Lior Amar,et al.  Combining Virtual Machine migration with process migration for HPC on multi-clusters and Grids , 2008, 2008 IEEE International Conference on Cluster Computing.

[25]  Calvin J. Ribbens,et al.  Hybrid Computing - Where HPC meets grid and Cloud Computing , 2011, Future Gener. Comput. Syst..

[26]  Marta Beltrán,et al.  How to Balance the Load on Heterogeneous Clusters , 2009, Int. J. High Perform. Comput. Appl..

[27]  Yossi Azar,et al.  Competitive routing of virtual circuits with unknown duration , 1994, SODA '94.

[28]  John Shalf,et al.  Exascale Operating Systems and Runtime Software Report , 2012 .

[29]  Michael Lang,et al.  System-Level Support for Composition of Applications , 2015, ROSS@HPDC.

[30]  Marlon E. Pierce,et al.  SWARM: Scheduling Large-Scale Jobs over the Loosely-Coupled HPC Clusters , 2008, 2008 IEEE Fourth International Conference on eScience.

[31]  James R. Larus,et al.  Join-Idle-Queue: A novel load balancing algorithm for dynamically scalable web services , 2011, Perform. Evaluation.

[32]  Michael Abd-El-Malek,et al.  Omega: flexible, scalable schedulers for large compute clusters , 2013, EuroSys '13.

[33]  Fernando Cores,et al.  Cooperative scheduling mechanism for large-scale peer-to-peer computing systems , 2013, J. Netw. Comput. Appl..

[34]  John Shalf,et al.  The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..

[35]  Patrick G. Bridges,et al.  Quantifying Scheduling Challenges for Exascale System Software , 2015, ROSS@HPDC.

[36]  David S. Rosenblum,et al.  Systematic Elaboration of Scalability Requirements through Goal-Obstacle Analysis , 2013, IEEE Transactions on Software Engineering.

[37]  Peter A. Dinda,et al.  Palacios and Kitten: New high performance operating systems for scalable virtualized and native supercomputing , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[38]  Dejan S. Milojicic,et al.  Evaluating and Improving the Performance and Scheduling of HPC Applications in Cloud , 2016, IEEE Transactions on Cloud Computing.

[39]  Qiang Huang,et al.  Power Consumption of Virtual Machine Live Migration in Clouds , 2011, 2011 Third International Conference on Communications and Mobile Computing.

[40]  David E. Bernholdt,et al.  Hobbes: composition and virtualization as the foundations of an extreme-scale OS/R , 2013, ROSS '13.

[41]  Xiaobing Feng,et al.  An empirical model for predicting cross-core performance interference on multicore processors , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[42]  Peter A. Dinda,et al.  Fast VMM-based overlay networking for bridging the cloud and high performance computing , 2014, Cluster Computing.

[43]  Brian Kocoloski,et al.  Improving compute node performance using virtualization , 2013, Int. J. High Perform. Comput. Appl..

[44]  Rolf Stadler,et al.  A Gossip Protocol for Dynamic Resource Management in Large Cloud Environments , 2012, IEEE Transactions on Network and Service Management.

[45]  Amos Fiat,et al.  On-line routing of virtual circuits with applications to load balancing and machine scheduling , 1997, JACM.

[46]  Allan Borodin,et al.  Online computation and competitive analysis , 1998 .

[47]  Takahiro Hirofuchi,et al.  SimGrid VM: Virtual Machine Support for a Simulation Framework of Distributed Systems , 2018, IEEE Transactions on Cloud Computing.

[48]  Carlo Curino,et al.  Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters , 2015, USENIX Annual Technical Conference.

[49]  Pete Beckman,et al.  Argo: An Exascale Operating System and Runtime , 2015 .