Reducing Load Imbalance of Virtual Clusters via Reconfiguration and Adaptive Job Scheduling

Extremely heterogeneous software stacks have encouraged the use of system virtualization technology for execution of composite high performance computing (HPC) applications to enable full utilization of extreme-scale HPC systems (ExaScale). Parts of composite applications, called loosely-coupled components, consist of a set of loosely-coupled CPU-intensive jobs. Jobs of loosely-coupled components run on a set of virtual machines (VMs), which in turn are distributed on physical machines. Co-location of VMs on physical machines, is the main source of interferences which cause uncertainty in jobs completion time. Motivated by this challenge, our main goal is to introduce an adaptive job scheduling method for VMs of loosely-coupled components in order to bound the negative impact of interferences. On the other hand, due to the abstraction of virtualization, job schedulers are unaware of the status of underlying physical machines. Introducing a scheme to dynamically reconfigure the job scheduler's parameters to inform scheduler about the true status of the physical machines, is our second goal. This paper presents a combination of ASSIGN-ROUTE online job scheduling and a reconfiguration technique allowing a given loosely-coupled component to balance its resource usage load, and thus improve the scaled execution of its loosely-coupled jobs. We prove that reconfiguration covers the virtualization unawareness in a way that the whole technique balances the load, comparable to the optimal load balancing for online deterministic unrelated parallel machine makespan minimization scheduling. We also show that the results of our experiments, support the theoretical achievements specially in case of scaled execution.

[1]  Michela Taufer Who is Afraid of I/O?: Exploring I/O Challenges and Opportunities at the Exascale , 2016, ScienceCloud@HPDC.

[2]  Laxmikant V. Kalé,et al.  OpenAtom: Scalable Ab-Initio Molecular Dynamics with Diverse Capabilities , 2016, ISC.

[3]  Patrick G. Bridges,et al.  Quantifying Scheduling Challenges for Exascale System Software , 2015, ROSS@HPDC.

[4]  Ee-Chien Chang,et al.  Competitive On-Line Scheduling with Level of Service , 2003, J. Sched..

[5]  Brian Kocoloski,et al.  Lightweight Memory Management for High Performance Applications in Consolidated Environments , 2016, IEEE Transactions on Parallel and Distributed Systems.

[6]  Brian Kocoloski,et al.  Improving compute node performance using virtualization , 2013, Int. J. High Perform. Comput. Appl..

[7]  Peter A. Dinda,et al.  Palacios and Kitten: New high performance operating systems for scalable virtualized and native supercomputing , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[8]  Baruch Awerbuch,et al.  An Opportunity Cost Approach for Job Assignment in a Scalable Computing Cluster , 2000, IEEE Trans. Parallel Distributed Syst..

[9]  Thomas Moschny,et al.  Dynamic Process Management with Allocation-internal Co-Scheduling towards Interactive Supercomputing , 2016 .

[10]  Laxmikant V. Kalé,et al.  Automated Load Balancing Invocation Based on Application Characteristics , 2012, 2012 IEEE International Conference on Cluster Computing.

[11]  Baruch Awerbuch,et al.  An Opportunity Cost Approach for Job Assignment and Reassignment in a Scalable Computing Cluster , 2002 .

[12]  Xiaobing Feng,et al.  An empirical model for predicting cross-core performance interference on multicore processors , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[13]  Peter A. Dinda,et al.  Fast VMM-based overlay networking for bridging the cloud and high performance computing , 2014, Cluster Computing.

[14]  Thomas L. Sterling The Biggest Need: a New Model of Computation , 2009, Int. J. High Perform. Comput. Appl..

[15]  Peter A. Dinda,et al.  Minimal-overhead virtualization of a large scale supercomputer , 2011, VEE '11.

[16]  David E. Bernholdt,et al.  Hobbes: composition and virtualization as the foundations of an extreme-scale OS/R , 2013, ROSS '13.

[17]  Achim Streit A Self-Tuning Job Scheduler Family with Dynamic Policy Switching , 2002, JSSPP.

[18]  Dejan S. Milojicic,et al.  Evaluating and Improving the Performance and Scheduling of HPC Applications in Cloud , 2016, IEEE Transactions on Cloud Computing.

[19]  Peter A. Dinda,et al.  Virtual-machine-based emulation of future generation high-performance computing systems , 2012, Int. J. High Perform. Comput. Appl..

[20]  Anja Strunk Costs of Virtual Machine Live Migration: A Survey , 2012, 2012 IEEE Eighth World Congress on Services.

[21]  Amos Fiat,et al.  On-line routing of virtual circuits with applications to load balancing and machine scheduling , 1997, JACM.

[22]  Allan Borodin,et al.  Online computation and competitive analysis , 1998 .

[23]  Takahiro Hirofuchi,et al.  SimGrid VM: Virtual Machine Support for a Simulation Framework of Distributed Systems , 2018, IEEE Transactions on Cloud Computing.

[24]  Eduardo Huedo,et al.  Grid Architecture from a Metascheduling Perspective , 2010, Computer.

[25]  Mohsen Sharifi,et al.  Scheduling of online compute-intensive synchronized jobs on high performance virtual clusters , 2017, J. Comput. Syst. Sci..

[26]  Laxmikant V. Kalé,et al.  A Malleable-Job System for Timeshared Parallel Machines , 2002, 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID'02).

[27]  John Shalf,et al.  The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..

[28]  Marta Beltrán,et al.  How to Balance the Load on Heterogeneous Clusters , 2009, Int. J. High Perform. Comput. Appl..

[29]  Yossi Azar,et al.  Competitive routing of virtual circuits with unknown duration , 1994, SODA '94.

[30]  Amnon Barak,et al.  The MOSIX multicomputer operating system for high performance cluster computing , 1998, Future Gener. Comput. Syst..

[31]  Bruce R. Childers,et al.  Implications of Memory Interference for Composed HPC Applications , 2015, MEMSYS.

[32]  Leah Epstein,et al.  Comparing online algorithms for bin packing problems , 2012, J. Sched..

[33]  Ioannis Caragiannis,et al.  Better bounds for online load balancing on unrelated machines , 2008, SODA '08.

[34]  Thomas L. Sterling Models of Computation — Enabling Exascale , 2009, Int. J. High Perform. Comput. Appl..

[35]  M. Sharifi,et al.  VCE: A New Personated Virtual Cluster Engine for Cluster Computing , 2008, 2008 3rd International Conference on Information and Communication Technologies: From Theory to Applications.

[36]  John Shalf,et al.  Exascale Operating Systems and Runtime Software Report , 2012 .

[37]  Michael Lang,et al.  System-Level Support for Composition of Applications , 2015, ROSS@HPDC.

[38]  Kevin Pedretti,et al.  Opportunities for leveraging OS virtualization in high-end supercomputing. , 2010 .

[39]  Inderveer Chana,et al.  A Survey on Resource Scheduling in Cloud Computing: Issues and Challenges , 2016, Journal of Grid Computing.

[40]  Henri Casanova,et al.  Versatile, scalable, and accurate simulation of distributed applications and platforms , 2014, J. Parallel Distributed Comput..

[41]  Patrick M. Widener,et al.  Understanding Performance Interference in Next-Generation HPC Systems , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[42]  Dong H. Ahn,et al.  Scalable I/O-Aware Job Scheduling for Burst Buffer Enabled HPC Clusters , 2016, HPDC.

[43]  Amnon Barak,et al.  Opportunity Cost Algorithms for Reduction of I/O and Interprocess Communication Overhead in a Computing Cluster , 2003, IEEE Trans. Parallel Distributed Syst..

[44]  Robert E. Tarjan,et al.  Amortized efficiency of list update and paging rules , 1985, CACM.