Hard real-time scheduling for parallel run-time systems

High performance parallel computing demands careful synchronization, timing, performance isolation and control, as well as the avoidance of OS and other types of noise. The employment of soft real-time systems toward these ends has already shown considerable promise, particularly for distributed memory machines. As processor core counts grow rapidly, a natural question is whether similar promise extends to the node. To address this question, we present the design, implementation, and performance evaluation of a hard real-time scheduler specifically for high performance parallel computing on shared memory nodes built on x64 processors, such as the Xeon Phi. Our scheduler is embedded in a kernel framework that is already specialized for high performance parallel run-times and applications, and that meets the basic requirements needed for a real-time OS (RTOS). The scheduler adds hard real-time threads both in their classic, individual form, and in a group form in which a group of parallel threads execute in near lock-step using only scalable, per-hardware-thread scheduling. On a current generation Intel Xeon Phi, the scheduler is able to handle timing constraints down to resolution of ∼13,000 cycles (∼10 μs), with synchronization to within ∼4,000 cycles (∼3 μs) among 255 parallel threads. The scheduler isolates a parallel group and is able to provide resource throttling with commensurate application performance. We also show that in some cases such fine-grain control over time allows us to eliminate barrier synchronization, leading to performance gains, particularly for fine-grain BSP workloads.

[1]  Morris A. Jette Performance Characteristics of Gang Scheduling in Multiprogrammed Environments , 1997, SC.

[2]  Ron Brightwell,et al.  Characterizing application sensitivity to OS interference using kernel-level noise injection , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Rolf Riesen,et al.  SUNMOS for the Intel Paragon - a brief user`s guide , 1994 .

[4]  Peter A. Dinda,et al.  Enabling Hybrid Parallel Runtimes Through Kernel and Virtualization Support , 2016, VEE.

[5]  Karen L. Karavanic,et al.  Performance implications of System Management Mode , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).

[6]  Andrea C. Arpaci-Dusseau,et al.  Effective distributed scheduling of parallel workloads , 1996, SIGMETRICS '96.

[7]  Larry Rudolph,et al.  Gang Scheduling Performance Benefits for Fine-Grain Synchronization , 1992, J. Parallel Distributed Comput..

[8]  Minglu Li,et al.  Dynamic adaptive scheduling for virtual machines , 2011, HPDC '11.

[9]  Dhabaleswar K. Panda,et al.  High Performance RDMA-Based MPI Implementation over InfiniBand , 2003, ICS '03.

[10]  Adrian Schüpbach,et al.  The multikernel: a new OS architecture for scalable multicore systems , 2009, SOSP '09.

[11]  Peter A. Dinda,et al.  A Case for Transforming Parallel Runtimes Into Operating System Kernels , 2015, HPDC.

[12]  Peter A. Dinda,et al.  Palacios and Kitten: New high performance operating systems for scalable virtualized and native supercomputing , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[13]  Leslie G. Valiant,et al.  Direct Bulk-Synchronous Parallel Algorithms , 1992, J. Parallel Distributed Comput..

[14]  Guy E. Blelloch,et al.  Implementation of a portable nested data-parallel language , 1993, PPOPP '93.

[15]  Thomas R. Gross,et al.  Decoupling synchronization and data transfer in message passing systems of parallel computers , 1995, ICS '95.

[16]  Dawson R. Engler,et al.  Exokernel: an operating system architecture for application-level resource management , 1995, SOSP.

[17]  Peter A. Dinda,et al.  VSched: Mixing Batch And Interactive Virtual Machines Using Periodic Real-time Scheduling , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[18]  Rolf Riesen,et al.  PUMA: an operating system for massively parallel systems , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[19]  Rolf Riesen,et al.  CONCURRENCY AND COMPUTATION : PRACTICE AND EXPERIENCE Concurrency Computat , 2008 .

[20]  Simon Peter,et al.  Resource management in a multicore operating system , 2012 .

[21]  Michael Mitzenmacher,et al.  The Power of Two Choices in Randomized Load Balancing , 2001, IEEE Trans. Parallel Distributed Syst..

[22]  Peter A. Dinda,et al.  Time-Sharing Parallel Applications with Performance Isolation and Control , 2007, Fourth International Conference on Autonomic Computing (ICAC'07).

[23]  Patrick G. Bridges,et al.  Quantifying Scheduling Challenges for Exascale System Software , 2015, ROSS@HPDC.

[24]  James W. Layland,et al.  Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment , 1989, JACM.

[25]  Yutaka Ishikawa,et al.  On the Scalability, Performance Isolation and Device Driver Transparency of the IHK/McKernel Hybrid Lightweight Kernel , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[26]  Kevin Klues,et al.  Tessellation: space-time partitioning in a manycore client OS , 2009 .

[27]  Indrani Paul,et al.  A Case for Criticality Models in Exascale Systems , 2016, 2016 IEEE International Conference on Cluster Computing (CLUSTER).

[28]  Hideyuki Tokuda,et al.  A Time-Driven Scheduling Model for Real-Time Operating Systems , 1985, RTSS.

[29]  John K. Ousterhout Scheduling Techniques for Concurrebt Systems. , 1982, ICDCS 1982.

[30]  Jon Crowcroft,et al.  Unikernels: library operating systems for the cloud , 2013, ASPLOS '13.

[31]  Bjarne Stroustrup,et al.  for Hard Real-Time Systems , 2008 .

[32]  Adrian Schüpbach,et al.  Design principles for end-to-end multicore schedulers , 2010 .

[33]  Kyle C. Hale Hybrid Runtime Systems , 2016 .

[34]  John Kubiatowicz,et al.  Tessellation: Refactoring the OS around explicit resource containers with continuous adaptation , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[35]  Peter A. Dinda,et al.  Multiverse: Easy Conversion of Runtime Systems into OS Kernels via Automatic Hybridization , 2017, 2017 IEEE International Conference on Autonomic Computing (ICAC).

[36]  Kevin T. Pedretti,et al.  Achieving Performance Isolation with Lightweight Co-Kernels , 2015, HPDC.

[37]  John K. Ousterhout,et al.  Scheduling Techniques for Concurrent Systems , 1982, ICDCS.

[38]  Daniel P. Siewiorek,et al.  A resource allocation model for QoS management , 1997, Proceedings Real-Time Systems Symposium.