Scheduling directives: Accelerating shared-memory many-core processor execution

We consider many-core processors with a task-graph oriented programming model, whereby scheduling constraints among tasks are decided offline, and are then enforced by the runtime system using dedicated hardware. Here, exposing and beneficially exploiting fine grain data and control parallelism is increasingly important. Therefore, high expressive power for stating such constraints/directives, along with the ability to implement them in fast, simple hardware, is critical for success. In this paper, we focus on the relationship among different duplicable (multi-instance) tasks, which are used to express and exploit data parallelism. We extend the conventional Start-After-Complete (precedence) constraint to also be usable between replicas of different such tasks rather than only between entire tasks, thereby increasing the exposable parallelism. Additionally, we propose the parameterized Start-After-Start constraint, which can be used to control the degree of ''lockstep'' among multiple such tasks, e.g., in order to improve cache performance when the tasks work on the same data. Also, we briefly describe several additional interesting directives. Finally, we show that the directives can be supported efficiently in hardware. Hypercore, a very efficient CREW PRAM-like shared-cache architecture, which is very challenging because it has extremely fast dispatching for basic constraints, is used in the discussion. However, the new directives have broader applicability. Having shown the possibility of simple implementation and indications of benefit, this motivates further exploration of these directives and their implementation in hardware, as well as their support by programming tools.

[1]  Wolfgang J. Paul,et al.  Hardware design , 1995 .

[2]  R. M. Tomasulo,et al.  An efficient algorithm for exploiting multiple arithmetic units , 1995 .

[3]  Edward A. Lee,et al.  A Compile-Time Scheduling Heuristic for Interconnection-Constrained Heterogeneous Processor Architectures , 1993, IEEE Trans. Parallel Distributed Syst..

[4]  Xingzhi Wen Hardware Design, Prototyping and Studies of the Explicit Multi-Threading (XMT) Paradigm , 2008 .

[5]  Uzi Vishkin,et al.  Fpga-based prototype of a pram-on-chip processor , 2008, CF '08.

[6]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[7]  Oliver Sinnen,et al.  Task Scheduling for Parallel Systems (Wiley Series on Parallel and Distributed Computing) , 2007 .

[8]  Donald W. Gillies,et al.  Scheduling Tasks with AND/OR Precedence Constraints , 1995, SIAM J. Comput..

[9]  Ronald L. Graham,et al.  Bounds for certain multiprocessing anomalies , 1966 .

[10]  Ishfaq Ahmad,et al.  Dynamic Critical-Path Scheduling: An Effective Technique for Allocating Task Graphs to Multiprocessors , 1996, IEEE Trans. Parallel Distributed Syst..

[11]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[12]  Christoph W. Kessler,et al.  Practical PRAM programming , 2000, Wiley series on parallel and distributed computing.

[13]  N. Bayer,et al.  Designing a central synchronization/scheduling unit for multiprocessors , 2000, 21st IEEE Convention of the Electrical and Electronic Engineers in Israel. Proceedings (Cat. No.00EX377).

[14]  Jeffrey D. Ullman,et al.  Polynomial complete scheduling problems , 1973, SOSP '73.

[15]  Peter Brucker,et al.  Scheduling Algorithms , 1995 .

[16]  Oliver Sinnen,et al.  Task Scheduling for Parallel Systems , 2007, Wiley series on parallel and distributed computing.

[17]  E.L. Lawler,et al.  Optimization and Approximation in Deterministic Sequencing and Scheduling: a Survey , 1977 .

[18]  David A. Patterson,et al.  Computer Architecture, Fifth Edition: A Quantitative Approach , 2011 .

[19]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .