Compiler Techniques for Massively Scalable Implicit Task Parallelism

Swift/T is a high-level language for writing concise, deterministic scripts that compose serial or parallel codes implemented in lower-level programming models into large-scale parallel applications. It executes using a data-driven task parallel execution model that is capable of orchestrating millions of concurrently executing asynchronous tasks on homogeneous or heterogeneous resources. Producing code that executes efficiently at this scale requires sophisticated compiler transformations: poorly optimized code inhibits scaling with excessive synchronization and communication. We present a comprehensive set of compiler techniques for data-driven task parallelism, including novel compiler optimizations and intermediate representations. We report application benchmark studies, including unbalanced tree search and simulated annealing, and demonstrate that our techniques greatly reduce communication overhead and enable extreme scalability, distributing up to 612 million dynamically load balanced tasks per second at scales of up to 262,144 cores without explicit parallelism, synchronization, or load balancing in application code.

[1]  Vivek Sarkar,et al.  POSC—a partitioning and optimizing SISAL compiler , 1990, ICS '90.

[2]  Vivek Sarkar,et al.  Intermediate language extensions for parallelism , 2011, SPLASH Workshops.

[3]  Vivek Sarkar,et al.  Reducing task creation and termination overhead in explicitly parallel programs , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[4]  Alejandro Duran,et al.  Productive Programming of GPU Clusters with OmpSs , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[5]  Eduard Ayguadé,et al.  Hierarchical Task-Based Programming With StarSs , 2009, Int. J. High Perform. Comput. Appl..

[6]  Jens Palsberg,et al.  Concurrent Collections , 2010 .

[7]  Suresh Jagannathan Continuation-based transformations for coordination languages , 2000, Theor. Comput. Sci..

[8]  R. Newton,et al.  A Lattice-Theoretical Approach to Deterministic Parallelism with Shared State , 2012 .

[9]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[10]  Vivek Sarkar,et al.  Data-Driven Tasks and Their Implementation , 2011, 2011 International Conference on Parallel Processing.

[11]  Elkin Garcia,et al.  TIDeFlow: The Time Iterated Dependency Flow Execution Model , 2011, 2011 First Workshop on Data-Flow Execution Models for Extreme Scale Computing.

[12]  Stephen L. Olivier,et al.  UTS: An Unbalanced Tree Search Benchmark , 2006, LCPC.

[13]  Daniel S. Katz,et al.  A Workflow-Aware Storage System: An Opportunity Study , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[14]  Daniel S. Katz,et al.  Turbine: a distributed-memory dataflow engine for extreme-scale many-task applications , 2012, SWEET '12.

[15]  Vivek Sarkar,et al.  Integrating Asynchronous Task Parallelism with MPI , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[16]  Ewing Lusk,et al.  More scalability, less pain : A simple programming model and its implementation for extreme computing. , 2010 .

[17]  Rob C. Knauerhase,et al.  For extreme parallelism, your OS is Sooooo last-millennium , 2012, HotPar'12.

[18]  Michael I. Gordon,et al.  Exploiting coarse-grained task, data, and pipeline parallelism in stream programs , 2006, ASPLOS XII.

[19]  Daniel S. Katz,et al.  Swift/T: scalable data flow programming for many-task applications , 2013, PPoPP '13.

[20]  Vivek Sarkar,et al.  Communication Optimizations for Distributed-Memory X10 Programs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[21]  K. R. Traub,et al.  A COMPILER FOR THE MIT TAGGED-TOKEN DATAFLOW ARCHITECTURE , 1986 .

[22]  Daniel S. Katz,et al.  Swift: A language for distributed parallel scripting , 2011, Parallel Comput..

[23]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[24]  Jack J. Dongarra,et al.  Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[25]  Vivek Sarkar,et al.  May-happen-in-parallel analysis of X10 programs , 2007, PPoPP.

[26]  Rajkumar Buyya,et al.  Workflow scheduling algorithms for grid computing , 2008 .

[27]  Daniel S. Katz,et al.  Design and evaluation of the gemtc framework for GPU-enabled many-task computing , 2014, HPDC '14.

[28]  Pramod G. Joisha,et al.  Compiler optimizations for nondeferred reference: counting garbage collection , 2006, ISMM '06.

[29]  Paraskevas Evripidou,et al.  Combining Compile and Run-Time Dependency Resolution in Data-Driven Multithreading , 2011, 2011 First Workshop on Data-Flow Execution Models for Extreme Scale Computing.

[30]  Cédric Bastoul,et al.  Code generation in the polyhedral model is easier than you think , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[31]  Thomas Hérault,et al.  DAGuE: A Generic Distributed DAG Engine for High Performance Computing , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[32]  Benjamin Goldberg,et al.  Static Analysis for Optimizing Reference Counting , 1995, Inf. Process. Lett..

[33]  Rishiyur S. Nikhil An Overview of the Parallel Language Id (a foundation for pH, a parallel dialect of Haskell) , 1993 .

[34]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[35]  Daniel S. Katz,et al.  Swift/T: Large-Scale Application Composition via Distributed-Memory Dataflow Processing , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[36]  Ian T. Foster,et al.  Dataflow coordination of data-parallel tasks via MPI 3.0 , 2013, EuroMPI.

[37]  David Maier,et al.  Logic and lattices for distributed programming , 2012, SoCC '12.

[38]  Scott B. Baden,et al.  Latency Hiding and Performance Tuning with Graph-Based Execution , 2011, 2011 First Workshop on Data-Flow Execution Models for Extreme Scale Computing.

[39]  Carl Kesselman,et al.  Optimizing Grid-Based Workflow Execution , 2005, Journal of Grid Computing.