Runtime Systems for Extreme Scale Platforms

Abstract : Future extreme-scale systems are expected to contain homogeneous and heterogeneous many-core processors, with O(103) cores per node and O(106) nodes overall. Effective combination of inter-node and intra-node parallelism is recognized to be a major software challenge for such systems. Further, applications will have to deal with constrained energy budgets as well as frequent faults and failures. To aid programmers manage these complexities and enhance programmability, much of recent research has focused on designing state-of-art software runtime systems. Such runtime systems are expected to be a critical component of the software ecosystem for the management of parallelism, locality, load balancing, energy and resilience on extreme-scale systems. In this dissertation, we address three key challenges faced by a runtime system using a dynamic task parallel framework for extreme-scale computing.

[1]  Stephen L. Olivier,et al.  UTS: An Unbalanced Tree Search Benchmark , 2006, LCPC.

[2]  Arvind,et al.  M-Structures: Extending a Parallel, Non-strict, Functional Language with State , 1991, FPCA.

[3]  Robert W. Numrich,et al.  Co-array Fortran for parallel programming , 1998, FORF.

[4]  Volker Strumpen,et al.  Cache oblivious stencil computations , 2005, ICS '05.

[5]  Katherine A. Yelick,et al.  Multi-threading and one-sided communication in parallel LU factorization , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[6]  Carl Hewitt,et al.  The incremental garbage collection of processes , 1977, Artificial Intelligence and Programming Languages.

[7]  Katherine A. Yelick,et al.  Titanium: A High-performance Java Dialect , 1998, Concurr. Pract. Exp..

[8]  Bowen Alpern,et al.  Modeling parallel computers as memory hierarchies , 1993, Proceedings of Workshop on Programming Models for Massively Parallel Computers.

[9]  William N. Scherer,et al.  A new vision for coarray Fortran , 2009, PGAS '09.

[10]  Vivek Sarkar Synchronization using counting semaphores , 1988, ICS '88.

[11]  Alexander Aiken,et al.  Legion: Expressing locality and independence with logical regions , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  Boris D. Lubachevsky Synchronization barrier and related tools for shared memory parallel programming , 2005, International Journal of Parallel Programming.

[13]  Eduard Ayguadé,et al.  Overlapping communication and computation by using a hybrid MPI/SMPSs approach , 2010, ICS '10.

[14]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[15]  Dhabaleswar K. Panda,et al.  MVAPICH-Aptus: Scalable high-performance multi-transport MPI over InfiniBand , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[16]  Andrea C. Arpaci-Dusseau,et al.  Parallel programming in Split-C , 1993, Supercomputing '93. Proceedings.

[17]  Jason Duell,et al.  Productivity and performance using partitioned global address space languages , 2007, PASCO '07.

[18]  Rajeev Thakur,et al.  Test suite for evaluating performance of multithreaded MPI communication , 2009, Parallel Comput..

[19]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[20]  Per Brinch Hansen The Origin of Concurrent Programming , 2002, Springer New York.

[21]  Vivek Sarkar,et al.  Habanero-Java: the new adventures of old X10 , 2011, PPPJ.

[22]  Katherine Yelick,et al.  Introduction to UPC and Language Specification , 2000 .

[23]  Stephen L. Olivier,et al.  Dynamic Load Balancing of Unbalanced Computations Using Message Passing , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[24]  Barbara Chapman,et al.  Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation) , 2007 .

[25]  Vivek Sarkar,et al.  Communication Optimizations for Distributed-Memory X10 Programs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[26]  Sayantan Sur,et al.  Unifying UPC and MPI runtimes: experience with MVAPICH , 2010, PGAS '10.

[27]  William J. Dally,et al.  Sequoia: Programming the Memory Hierarchy , 2006, International Conference on Software Composition.

[28]  David A. Padua,et al.  Programming for parallelism and locality with hierarchically tiled arrays , 2006, PPoPP '06.

[29]  Paul N. Hilfinger,et al.  Better Tiling and Array Contraction for Compiling Scientific Programs , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[30]  David Chase,et al.  Dynamic circular work-stealing deque , 2005, SPAA '05.

[31]  Alexandros Stamatakis,et al.  Hybrid MPI/Pthreads parallelization of the RAxML phylogenetics code , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[32]  Yves Robert,et al.  Implementing a Systolic Algorithm for QR Factorization on Multicore Clusters with PaRSEC , 2013, Euro-Par Workshops.

[33]  Katherine A. Yelick,et al.  Hybrid PGAS runtime support for multicore nodes , 2010, PGAS '10.

[34]  Rolf Riesen,et al.  Portals 3.0: protocol building blocks for low overhead communication , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[35]  Michael L. Scott,et al.  Fast, contention-free combining tree barriers for shared-memory multiprocessors , 1994, International Journal of Parallel Programming.

[36]  Graph Topology MPI at Exascale , 2010 .

[37]  Vivek Sarkar,et al.  Comparing the usability of library vs. language approaches to task parallelism , 2010, PLATEAU '10.

[38]  Rohit Chandra,et al.  Parallel programming in openMP , 2000 .

[39]  Haoqiang Jin,et al.  Performance Characteristics of the Multi-Zone NAS Parallel Benchmarks , 2004, IPDPS.

[40]  Franck Cappello,et al.  MPI versus MPI+OpenMP on the IBM SP for the NAS Benchmarks , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[41]  Wu-chun Feng,et al.  On the efficacy of GPU-integrated MPI for scientific applications , 2013, HPDC '13.

[42]  D. Panda,et al.  Extending OpenSHMEM for GPU Computing , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[43]  Lei Huang,et al.  Unified Parallel C for GPU Clusters: Language Extensions and Compiler Implementation , 2010, LCPC.

[44]  Stephen A. Jarvis,et al.  Performance analysis of a hybrid MPI/CUDA implementation of the NASLU benchmark , 2011, PERV.

[45]  Georg Hager,et al.  Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[46]  Katherine A. Yelick,et al.  Optimizing bandwidth limited problems using one-sided communication and overlap , 2005, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[47]  Katherine Yelick,et al.  Titanium Language Reference Manual , 2001 .

[48]  Vivek Sarkar,et al.  Phasers: a unified deadlock-free construct for collective and point-to-point synchronization , 2008, ICS '08.

[49]  Vivek Sarkar,et al.  Phaser accumulators: A new reduction construct for dynamic parallelism , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[50]  Laxmikant V. Kalé,et al.  Adaptive MPI , 2003, LCPC.

[51]  Stephen A. Edwards,et al.  Compile-Time Analysis and Specialization of Clocks in Concurrent Programs , 2009, CC.

[52]  James Reinders,et al.  Intel® threading building blocks , 2008 .

[53]  Vivek Sarkar,et al.  Hardware and Software Tradeoffs for Task Synchronization on Manycore Architectures , 2011, Euro-Par.

[54]  Guillaume Mercier,et al.  Design and evaluation of Nemesis, a scalable, low-latency, message-passing communication subsystem , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[55]  John C. Reynolds,et al.  The discoveries of continuations , 1993, LISP Symb. Comput..

[56]  Dhabaleswar K. Panda,et al.  Scalable Earthquake Simulation on Petascale Supercomputers , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[57]  Vivek Sarkar,et al.  Integrating Asynchronous Task Parallelism with MPI , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[58]  Dong Li,et al.  Hybrid MPI/OpenMP power-aware computing , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[59]  Yvon Jégou,et al.  Task Migration and Fine Grain Parallelism on Distributed Memory Architectures , 1997, PaCT.

[60]  Bradley C. Kuszmaul,et al.  The pochoir stencil compiler , 2011, SPAA '11.

[61]  Alejandro Duran,et al.  Productive Cluster Programming with OmpSs , 2011, Euro-Par.

[62]  Franck Cappello,et al.  Performance characteristics of a network of commodity multiprocessors for the NAS benchmarks using a hybrid memory model , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[63]  Keshav Pingali,et al.  I-structures: Data structures for parallel computing , 1986, Graph Reduction.

[64]  Georg Hager,et al.  Hybrid MPI and OpenMP Parallel Programming , 2006, PVM/MPI.

[65]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[66]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[67]  Dan Bonachea GASNet Specification, v1.1 , 2002 .

[68]  Nicholas Carriero,et al.  Linda and Friends , 1986, Computer.

[69]  Victor Luchangco,et al.  The Fortress Language Specification Version 1.0 , 2007 .

[70]  Jonathan Green,et al.  Multi-core and Network Aware MPI Topology Functions , 2011, EuroMPI.

[71]  Guy E. Blelloch,et al.  Vector Models for Data-Parallel Computing , 1990 .

[72]  Alejandro Duran,et al.  Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..

[73]  Hasan U. Akay,et al.  Hybrid Parallelism for CFD Simulations: Combining MPI with OpenMP , 2009 .

[74]  Tao Yang,et al.  Run-Time Techniques for Exploiting Irregular Task Parallelism on Distributed Memory Architectures , 1997, J. Parallel Distributed Comput..

[75]  Laxmikant V. Kalé,et al.  Work stealing and persistence-based load balancers for iterative overdecomposed applications , 2012, HPDC '12.

[76]  Kourosh Gharachorloo,et al.  Fine-grain software distributed shared memory on SMP clusters , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[77]  David A. Bader,et al.  A novel FDTD application featuring OpenMP-MPI hybrid parallelization , 2004 .

[78]  Robert H. Halstead,et al.  Implementation of multilisp: Lisp on a multiprocessor , 1984, LFP '84.

[79]  Debra Hensgen,et al.  Two algorithms for barrier synchronization , 1988, International Journal of Parallel Programming.

[80]  Jack Dongarra,et al.  MPI: The Complete Reference , 1996 .

[81]  Vivek Sarkar,et al.  Hierarchical Place Trees: A Portable Abstraction for Task Parallelism and Data Movement , 2009, LCPC.

[82]  Robert D. Blumofe,et al.  Scheduling multithreaded computations by work stealing , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[83]  Philippe Olivier Alexandre Navaux,et al.  Challenges and Issues of Supporting Task Parallelism in MPI , 2010, EuroMPI.

[84]  Guang R. Gao,et al.  TiNy threads: a thread virtual machine for the Cyclops64 cellular architecture , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[85]  Yi Guo,et al.  Work-first and help-first scheduling policies for async-finish task parallelism , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[86]  Michael L. Scott,et al.  Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[87]  Thomas L. Sterling,et al.  ParalleX An Advanced Parallel Execution Model for Scaling-Impaired Applications , 2009, 2009 International Conference on Parallel Processing Workshops.

[88]  Katherine Yelick,et al.  Auto-tuning stencil codes for cache-based multicore platforms , 2009 .

[89]  Sachin S. Sapatnekar,et al.  A Framework for Exploiting Task and Data Parallelism on Distributed Memory Multicomputers , 1997, IEEE Trans. Parallel Distributed Syst..

[90]  Vivek Sarkar,et al.  Data-Driven Tasks and Their Implementation , 2011, 2011 International Conference on Parallel Processing.

[91]  Emmanuel Jeannot,et al.  Near-Optimal Placement of MPI Processes on Hierarchical NUMA Architectures , 2010, Euro-Par.

[92]  Vivek Sarkar,et al.  Hierarchical phasers for scalable synchronization and reductions in dynamic parallelism , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[93]  Anoop Gupta,et al.  COOL: An object-based language for parallel programming , 1994, Computer.

[94]  Bradford L. Chamberlain,et al.  Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..

[95]  Eugene D. Brooks,et al.  The butterfly barrier , 1986, International Journal of Parallel Programming.

[96]  Fiona Reid,et al.  A Microbenchmark Suite for OpenMP Tasks , 2012, IWOMP.