Data-Driven Concurrency for High Performance Computing

In this work, we utilize dynamic dataflow/data-driven techniques to improve the performance of high performance computing (HPC) systems. The proposed techniques are implemented and evaluated through an efficient, portable, and robust programming framework that enables data-driven concurrency on HPC systems. The proposed framework is based on data-driven multithreading (DDM), a hybrid control-flow/dataflow model that schedules threads based on data availability on sequential processors. The proposed framework was evaluated using several benchmarks, with different characteristics, on two different systems: a 4-node AMD system with a total of 128 cores and a 64-node Intel HPC system with a total of 768 cores. The performance evaluation shows that the proposed framework scales well and tolerates scheduling overheads and memory latencies effectively. We also compare our framework to MPI, DDM-VM, and OmpSs@Cluster. The comparison results show that the proposed framework obtains comparable or better performance.

[1]  Ian Watson,et al.  The Manchester prototype dataflow computer , 1985, CACM.

[2]  Dan Bonachea GASNet Specification, v1.1 , 2002 .

[3]  Vítor Santos Costa,et al.  Trebuchet: exploring TLP with dataflow virtualisation , 2011, Int. J. High Perform. Syst. Archit..

[4]  J. Demmel,et al.  Sun Microsystems , 1996 .

[5]  Philippe Olivier Alexandre Navaux,et al.  Challenges and Issues of Supporting Task Parallelism in MPI , 2010, EuroMPI.

[6]  Ali R. Hurson,et al.  Dataflow architectures and multithreading , 1994, Computer.

[7]  Paraskevas Evripidou,et al.  Data-flow Concurrency on Distributed Multi-core Systems , 2013 .

[8]  Marco Danelutto,et al.  FastFlow: High-level and Efficient Streaming on Multi-core , 2017 .

[9]  Wei Ge,et al.  The Sunway TaihuLight supercomputer: system and applications , 2016, Science China Information Sciences.

[10]  Josep Torrellas,et al.  Data forwarding in scalable shared-memory multiprocessors , 1995, ICS '95.

[11]  Samuel H. Fuller,et al.  Computing Performance: Game Over or Next Level? , 2011, Computer.

[12]  Kathleen Knobe,et al.  Concurrent Collections on Distributed Memory Theory Put into Practice , 2013, 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[13]  Samer Arandi,et al.  The data-driven multithreading virtual machine , 2012 .

[14]  Eduard Ayguadé,et al.  Hierarchical Task-Based Programming With StarSs , 2009, Int. J. High Perform. Comput. Appl..

[15]  Hartmut Kaiser,et al.  HPX: A Task Based Programming Model in a Global Address Space , 2014, PGAS.

[16]  Paraskevas Evripidou,et al.  Architectural Support for Data-Driven Execution , 2015, ACM Trans. Archit. Code Optim..

[17]  Paraskevas Evripidou,et al.  DDM-VMc: the data-driven multithreading virtual machine for the cell processor , 2011, HiPEAC.

[18]  James Reinders,et al.  Intel threading building blocks - outfitting C++ for multi-core processor parallelism , 2007 .

[19]  Margaret H. Wright,et al.  The opportunities and challenges of exascale computing , 2010 .

[20]  Pen-Chung Yew,et al.  Data Prefetching and Data Forwarding in Shared Memory Multiprocessors , 1994, 1994 Internatonal Conference on Parallel Processing Vol. 2.

[21]  Christina Freytag,et al.  Using Mpi Portable Parallel Programming With The Message Passing Interface , 2016 .

[22]  Paraskevas Evripidou,et al.  Data-Driven Multithreading Using Conventional Microprocessors , 2006, IEEE Transactions on Parallel and Distributed Systems.

[23]  Paraskevas Evripidou,et al.  Verilog-based simulation of hardware support for data-flow concurrency on multicore systems , 2013, 2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS).

[24]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[25]  Jack B. Dennis,et al.  First version of a data flow procedure language , 1974, Symposium on Programming.

[26]  Thomas Hérault,et al.  Scalable Dense Linear Algebra on Heterogeneous Hardware , 2012, High Performance Computing Workshop.

[27]  Tracy Camp,et al.  A taxonomy of distributed termination detection algorithms , 1998, J. Syst. Softw..

[28]  Anthony Skjellum,et al.  Using MPI: Portable Programming with the Message-Passing Interface , 1999 .

[29]  Bradford L. Chamberlain,et al.  Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..

[30]  Guang R. Gao,et al.  Position Paper: Using a "Codelet" Program Execution Model for Exascale Machines , 2011 .

[31]  Benoît Meister,et al.  The Open Community Runtime: A runtime system for extreme scale computing , 2016, 2016 IEEE High Performance Extreme Computing Conference (HPEC).

[32]  Kathleen Knobe,et al.  Ease of use with concurrent collections (CnC) , 2009 .

[33]  George Bosilca,et al.  PaRSEC in Practice: Optimizing a Legacy Chemistry Application through Distributed Task-Based Execution , 2015, 2015 IEEE International Conference on Cluster Computing.

[34]  J. Dongarra,et al.  Lightweight Superscalar Task Execution in Distributed Memory , 2014 .

[35]  Felipe Maia Galvão França,et al.  Task Scheduling in Sucuri Dataflow Library , 2016, 2016 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW).

[36]  Jack Dongarra,et al.  ScaLAPACK user's guide , 1997 .

[37]  Josep Torrellas,et al.  Data Forwarding in Scalable Shared-Memory Multiprocessors , 1996, IEEE Trans. Parallel Distributed Syst..

[38]  Paraskevas Evripidou,et al.  DDMCPP : The Data-Driven Multithreading C PreProcessor , 2007 .

[39]  Paraskevas Evripidou,et al.  Paradigm Shift for EXASCALE Computing , 2015 .

[40]  Thomas Hérault,et al.  DAGuE: A Generic Distributed DAG Engine for High Performance Computing , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[41]  Lapack Working Scheduling Linear Algebra Operations on Multicore Processors – , 2009 .

[42]  Guang R. Gao,et al.  Application characterization at scale: lessons learned from developing a distributed open community runtime system for high performance computing , 2016, Conf. Computing Frontiers.

[43]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[44]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[45]  Oliver Pell,et al.  Maximum Performance Computing with Dataflow Engines , 2012, Computing in Science & Engineering.

[46]  Edsger W. Dijkstra,et al.  Termination Detection for Diffusing Computations , 1980, Inf. Process. Lett..

[47]  Arvind,et al.  Two Fundamental Issues in Multiprocessing , 1987, Parallel Computing in Science and Engineering.

[48]  Gurindar S. Sohi,et al.  Dataflow execution of sequential imperative programs on multicore architectures , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[49]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[50]  Alejandro Duran,et al.  Productive Cluster Programming with OmpSs , 2011, Euro-Par.

[51]  Pedro C. Diniz Exascale Programming Challenges , 2011 .

[52]  Eduard Ayguadé,et al.  Implementing OmpSs support for regions of data in architectures with multiple address spaces , 2013, ICS '13.

[53]  Roberto Giorgi,et al.  DTA-C: A Decoupled multi-Threaded Architecture for CMP Systems , 2007, 19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07).

[54]  P. Evripidou,et al.  FREDDO: an efficient Framework for Runtime Execution of Data-Driven Objects , 2017 .

[55]  Vítor Santos Costa,et al.  Couillard: Parallel programming via coarse-grained Data-flow Compilation , 2011, Parallel Comput..

[56]  Thomas L. Sterling,et al.  ParalleX An Advanced Parallel Execution Model for Scaling-Impaired Applications , 2009, 2009 International Conference on Parallel Processing Workshops.

[57]  Jack B. Dennis,et al.  A preliminary architecture for a basic data-flow processor , 1974, ISCA '75.

[58]  Paraskevas Evripidou,et al.  Programming multi-core architectures using Data-Flow techniques , 2010, 2010 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation.

[59]  Paraskevas Evripidou Thread Synchronization Unit (TSU): A Building Block for High Performance Computers , 1997, ISHPC.

[60]  Krishna M. Kavi,et al.  Scheduled Dataflow: Execution Paradigm, Architecture, and Performance Evaluation , 2001, IEEE Trans. Computers.

[61]  Paul Krzyzanowski Distributed shared memory , 1998 .

[62]  Nam Ho,et al.  Dataflow Support in x86_64 Multicore Architectures through Small Hardware Extensions , 2015, 2015 Euromicro Conference on Digital System Design.

[63]  James Demmel,et al.  Communication-optimal Parallel and Sequential Cholesky Decomposition , 2009, SIAM J. Sci. Comput..

[64]  Arvind,et al.  The U-Interpreter , 1982, Computer.

[65]  Katherine Yelick,et al.  Introduction to UPC and Language Specification , 2000 .

[66]  Peter Kilpatrick,et al.  Targeting Distributed Systems in FastFlow , 2012, Euro-Par Workshops.

[67]  Veljko M. Milutinovic,et al.  Distributed shared memory: concepts and systems , 1997, IEEE Parallel Distributed Technol. Syst. Appl..

[68]  Jack J. Dongarra,et al.  Scaling up matrix computations on shared-memory manycore systems with 1000 CPU cores , 2014, ICS '14.

[69]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).