Enhancing Productivity and Performance Portability of General-Purpose Parallel Programming

This work focuses on compiler and run-time techniques for improving the productivity and the performance portability of general-purpose parallel programming. More specifically, we focus on shared-memory task-parallel languages , where the programmer explicitly exposes parallelism in the form of short tasks that may outnumber the cores by orders of magnitude. The compiler, the run-time, and the platform (henceforth the system) are responsible for harnessing this unpredictable amount of parallelism, which can vary from none to excessive, towards efficient execution. The challenge arises from the aspiration to support fine-grained irregular computations and nested parallelism. This work is even more ambitious by also aspiring to lay the foundations to efficiently support declarative code, where the programmer exposes all available parallelism, using high-level language constructs such as parallel loops, reducers or futures. The appeal of declarative code is twofold for general-purpose programming: it is often easier for the programmer who does not have to worry about the granularity of the exposed parallelism, and it achieves better performance portability by avoiding overfitting to a small range of platforms and inputs for which the programmer is coarsening. Furthermore, PRAM algorithms, an important class of parallel algorithms, naturally lend themselves to declarative programming, so supporting it is a necessary condition for capitalizing on the wealth of the PRAM theory. Unfortunately, declarative codes often expose such an overwhelming number of fine-grained tasks that existing systems fail to deliver performance. Our contributions can be partitioned into three components. First, we tackle the issue of coarsening, which declarative code leaves to the system. We identify two goals of coarsening and advocate tackling them separately, using static compiler transformations for one and dynamic run-time approaches for the other. Additionally, we present evidence that the current practice of burdening the programmer with coarsening either leads to codes with poor performance-portability, or to a significantly increased programming effort. This is a ``show-stopper'' for general-purpose programming. To compare the performance portability among approaches, we define an experimental framework and two metrics, and we demonstrate that our approaches are preferable. We close the chapter on coarsening by presenting compiler transformations that automatically coarsen some types of very fine-grained codes. Second, we propose Lazy Scheduling, an innovative run-time scheduling technique that infers the platform load at run-time, using information already maintained. Based on the inferred load, Lazy Scheduling adapts the amount of available parallelism it exposes for parallel execution and, thus, saves parallelism overheads that existing approaches pay. We implement Lazy Scheduling and present experimental results on four different platforms. The results show that Lazy Scheduling is vastly superior for declarative codes and competitive, if not better, for coarsened codes. Moreover, Lazy Scheduling is also superior in terms of performance-portability, supporting our thesis that it is possible to achieve reasonable efficiency and performance portability with declarative codes. Finally, we also implement Lazy Scheduling on XMT, an experimental manycore platform developed at the University of Maryland, which was designed to support codes derived from PRAM algorithms. On XMT, we manage to harness the existing hardware support for scheduling flat parallelism to compose it with Lazy Scheduling, which supports nested parallelism. In the resulting hybrid scheduler , the hardware and software work in synergy to overcome each other's weaknesses. We show the performance composability of the hardware and software schedulers, both in an abstract cost model and experimentally, as the hybrid always performs better than the software scheduler alone. Furthermore, the cost model is validated by using it to predict if it is preferable to execute a code sequentially, with outer parallelism, or with nested parallelism, depending on the input, the available hardware parallelism and the calling context of the parallel code.

[1]  Edward D. Lazowska,et al.  Speedup Versus Efficiency in Parallel Systems , 1989, IEEE Trans. Computers.

[2]  Robert D. Blumofe,et al.  Scheduling multithreaded computations by work stealing , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[3]  LamportLeslie Time, clocks, and the ordering of events in a distributed system , 1978 .

[4]  Alejandro Duran,et al.  An adaptive cut-off for task parallelism , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Yi Guo,et al.  SLAW: A scalable locality-aware adaptive work-stealing scheduler , 2010, IPDPS.

[6]  Gang Qu,et al.  Mesh-of-Trees and Alternative Interconnection Networks for Single-Chip Parallelism , 2009, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[7]  Edsger W. Dijkstra,et al.  Go To Statement Considered Harmful , 2022, Software Pioneers.

[8]  Uzi Vishkin,et al.  A Low-Overhead Asynchronous Interconnection Network for GALS Chip Multiprocessors , 2011, 2010 Fourth ACM/IEEE International Symposium on Networks-on-Chip.

[9]  Gang Qu,et al.  Layout-Accurate Design and Implementation of a High-Throughput Interconnection Network for Single-Chip Parallel Processing , 2007, 15th Annual IEEE Symposium on High-Performance Interconnects (HOTI 2007).

[10]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[11]  Charles E. Leiserson,et al.  The Cilk++ concurrency platform , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[12]  Sanguthevar Rajasekaran,et al.  Handbook of Parallel Computing - Models, Algorithms and Applications , 2007 .

[13]  Fuat Keceli,et al.  Power-Performance Comparison of Single-Task Driven Many-Cores , 2011, 2011 IEEE 17th International Conference on Parallel and Distributed Systems.

[14]  Sarita V. Adve,et al.  DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[15]  Guy E. Blelloch,et al.  Compiling Collection-Oriented Languages onto Massively Parallel Computers , 1990, J. Parallel Distributed Comput..

[16]  Uzi Vishkin,et al.  PRAM-on-chip: first commitment to silicon , 2007, SPAA '07.

[17]  Richard Cole,et al.  Deterministic coin tossing and accelerating cascades: micro and macro techniques for designing parallel algorithms , 1986, STOC '86.

[18]  Ronald L. Rivest,et al.  Introduction to Algorithms, third edition , 2009 .

[19]  Christoph W. Kessler,et al.  Practical PRAM programming , 2000, Wiley series on parallel and distributed computing.

[20]  Herb Sutter,et al.  The Free Lunch Is Over A Fundamental Turn Toward Concurrency in Software , 2013 .

[21]  Hans-Juergen Boehm,et al.  HP Laboratories , 2006 .

[22]  Robert H. Halstead,et al.  Implementation of multilisp: Lisp on a multiprocessor , 1984, LFP '84.

[23]  Yuxiong He,et al.  The Cilkview scalability analyzer , 2010, SPAA '10.

[24]  Fuat Keceli,et al.  Resource-Aware Compiler Prefetching for Many-Cores , 2010, 2010 Ninth International Symposium on Parallel and Distributed Computing.

[25]  Uzi Vishkin,et al.  Using simple abstraction to reinvent computing for parallelism , 2011, Commun. ACM.

[26]  Uzi Vishkin,et al.  A pilot study to compare programming effort for two parallel programming models , 2007, J. Syst. Softw..

[27]  Seth Copen Goldstein,et al.  Lazy Threads: Implementing a Fast Parallel Call , 1996, J. Parallel Distributed Comput..

[28]  Anoop Gupta,et al.  Parallel computer architecture - a hardware / software approach , 1998 .

[29]  Mark Moir,et al.  A dynamic-sized nonblocking work stealing deque , 2006, Distributed Computing.

[30]  F. Warren Burton,et al.  Executing functional programs on a virtual tree of processors , 1981, FPCA '81.

[31]  Maged M. Michael,et al.  Idempotent work stealing , 2009, PPoPP '09.

[32]  Doug Lea,et al.  A Java fork/join framework , 2000, JAVA '00.

[33]  Eduard Ayguadé,et al.  Thread fork/join techniques for multi-level parallelism exploitation in NUMA multiprocessors , 1999, ICS '99.

[34]  Lars Bergstrom,et al.  Lazy tree splitting , 2012, J. Funct. Program..

[35]  C. Greg Plaxton,et al.  Thread Scheduling for Multiprogrammed Multiprocessors , 1998, SPAA '98.

[36]  Alexandros Tzannes,et al.  Lazy binary-splitting: a run-time adaptive work-stealing scheduler , 2010, PPoPP '10.

[37]  Uzi Vishkin,et al.  XMT-GPU: A PRAM Architecture for Graphics Computation , 2008, 2008 37th International Conference on Parallel Processing.

[38]  George C. Caragea,et al.  General-Purpose vs . GPU : Comparison of Many-Cores on Irregular Workloads , 2010 .

[39]  Fuat Keceli,et al.  Power and Performance studies of the Explicit Multi-Threading (XMT) Architecture , 2011 .

[40]  Eleftherios D. Polychronopoulos,et al.  Efficient Runtime Thread Management for the Nano-Threads Programming Model , 1998, IPPS/SPDP Workshops.

[41]  Guy E. Blelloch,et al.  Implementation of a portable nested data-parallel language , 1993, PPOPP '93.

[42]  A. B. Saybasili HIGHLY PARALLEL MULTI-DIMENSIONAL FAST FOURIER TRANSFORM ON FINE-AND COARSE-GRAINED MANY-CORE APPROACHES , 2022 .

[43]  Zheng Li,et al.  Scalable hardware support for conditional parallelization , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[44]  Aydin O. Balkan Mesh-of-Trees Interconnection Network for an Explicitly Multi-Threaded Parallel Computer Architecture , 2008 .

[45]  Robert H. Halstead,et al.  Mul-T: a high-performance parallel Lisp , 1989, PLDI '89.

[46]  Uzi Vishkin,et al.  Is teaching parallel algorithmic thinking to high school students possible?: one teacher's experience , 2010, SIGCSE.

[47]  Sebastian Burckhardt,et al.  The design of a task parallel library , 2009, OOPSLA.

[48]  Arthur Charguéraud,et al.  Oracle scheduling: controlling granularity in implicitly parallel languages , 2011, OOPSLA '11.

[49]  Uzi Vishkin,et al.  Brief announcement: speedups for parallel graph triconnectivity , 2012, SPAA '12.

[50]  Matteo Frigo,et al.  Reducers and other Cilk++ hyperobjects , 2009, SPAA '09.

[51]  Gang Qu,et al.  A Mesh-of-Trees Interconnection Network for Single-Chip Parallel Processing , 2006, IEEE 17th International Conference on Application-specific Systems, Architectures and Processors (ASAP'06).

[52]  Uzi Vishkin,et al.  Towards a First Vertical Prototyping of an Extremely Fine-Grained Parallel Programming Approach , 2003, Theory of Computing Systems.

[53]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[54]  George C. Caragea,et al.  Models for Advancing PRAM and Other Algorithms into Parallel Programs for a PRAM-On-Chip Platform , 2006, Handbook of Parallel Computing.

[55]  George C. Caragea,et al.  Brief announcement: better speedups for parallel max-flow , 2011, SPAA '11.

[56]  George C. Caragea Optimizing for a Many-Core Architecture without Compromising Ease-of-Programming , 2011 .

[57]  Laurie J. Hendren,et al.  SableCC, an object-oriented compiler framework , 1998, Proceedings. Technology of Object-Oriented Languages. TOOLS 26 (Cat. No.98EX176).

[58]  I-Ting Angelina Lee,et al.  Location-based memory fences , 2011, SPAA '11.

[59]  Nir Shavit,et al.  Non-blocking steal-half work queues , 2002, PODC '02.

[60]  Fuat Keceli,et al.  Thermal Management of a Many-Core Processor under Fine-Grained Parallelism , 2011, Euro-Par Workshops.

[61]  Michael Voss,et al.  Optimization via Reflection on Work Stealing in TBB , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[62]  Yi Guo,et al.  SLAW: A scalable locality-aware adaptive work-stealing scheduler , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[63]  Jesper Larsson Träff,et al.  User-Land Work Stealing Schedulers: Towards a Standard , 2008, 2008 International Conference on Complex, Intelligent and Software Intensive Systems.

[64]  Gang Qu,et al.  An area-efficient high-throughput hybrid interconnection network for single-chip parallel processing , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[65]  Fuat Keceli,et al.  Toolchain for Programming, Simulating and Studying the XMT Many-Core Architecture , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[66]  Uzi Vishkin,et al.  Better speedups using simpler parallel programming for graph connectivity and biconnectivity , 2012, PMAM '12.

[67]  Benjamin Hindman,et al.  Composing parallel software efficiently with lithe , 2010, PLDI '10.

[68]  Uzi Vishkin,et al.  Explicit multi-threading (XMT) bridging models for instruction parallelism (extended abstract) , 1998, SPAA '98.

[69]  Olivier Temam,et al.  A Practical Approach for Reconciling High and Predictable Performance in Non-Regular Parallel Programs , 2008, 2008 Design, Automation and Test in Europe.

[70]  George C. Necula,et al.  CIL: Intermediate Language and Tools for Analysis and Transformation of C Programs , 2002, CC.

[71]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[72]  Akinori Yonezawa,et al.  StackThreads/MP: integrating futures into calling standards , 1999, PPoPP '99.

[73]  Uzi Vishkin,et al.  An O(n² log n) Parallel MAX-FLOW Algorithm , 1982, J. Algorithms.

[74]  James Reinders,et al.  Intel® threading building blocks , 2008 .

[75]  Seth Copen Goldstein,et al.  Lazy threads: compiler and runtime structures for fine-grained parallel programming , 1998 .

[76]  George C. Caragea,et al.  Brief announcement: performance potential of an easy-to-program PRAM-on-chip prototype versus state-of-the-art processor , 2009, SPAA '09.

[77]  Jeremy Manson,et al.  The Java memory model , 2005, POPL '05.

[78]  Ralph Grishman,et al.  The NYU Ultracomputer—designing a MIMD, shared-memory parallel machine (Extended Abstract) , 1982, ISCA 1982.

[79]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[80]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[81]  Uzi Vishkin,et al.  Fpga-based prototype of a pram-on-chip processor , 2008, CF '08.

[82]  Alejandro Duran,et al.  Automatic thread distribution for nested parallelism in OpenMP , 2005, ICS '05.

[83]  Sarita V. Adve,et al.  Shared Memory Consistency Models: A Tutorial , 1996, Computer.

[84]  Kunle Olukotun,et al.  Efficient Parallel Graph Exploration on Multi-Core CPU and GPU , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[85]  Xingzhi Wen Hardware Design, Prototyping and Studies of the Explicit Multi-Threading (XMT) Paradigm , 2008 .

[86]  David Chase,et al.  Dynamic circular work-stealing deque , 2005, SPAA '05.

[87]  Robert H. Halstead,et al.  Lazy task creation: a technique for increasing the granularity of parallel programs , 1990, LISP and Functional Programming.

[88]  S. Sitharama Iyengar,et al.  Introduction to parallel algorithms , 1998, Wiley series on parallel and distributed computing.

[89]  Christopher J. Hughes,et al.  Carbon: architectural support for fine-grained parallelism on chip multiprocessors , 2007, ISCA '07.

[90]  Guy E. Blelloch,et al.  The Data Locality of Work Stealing , 2002, SPAA '00.