Toward data-driven architectural support in improving the performance of future HPC architectures

Abstract We propose architectures based on Data-Driven Multithreading (DDM), a hybrid control-flow/data-flow model, to address the concurrency challenges faced by future High-Performance Computing (HPC) systems. We focus on the design and implementation of an optimized hardware Thread Scheduling Unit (TSU) and its integration into a multi-core system dubbed MiDAS. The TSU is the core of the DDM model and it orchestrates the execution of multiple threads on sequential processors based on data availability. MiDAS was prototyped on a Xilinx Virtex-6 FPGA and extensively evaluated using several micro-benchmarks, showing that it achieves linearly-growing performance as the processing core count increases even when running benchmarks comprising very small problem sizes. Under the largest problem size tested and with all 8 available cores being utilized, MiDAS achieves an average speedup of 7.91×, exhibiting 98.8% utilization efficiency. Further, several results pertaining to the proposed hardware TSU are provided, including FPGA real estate requirements, where it is found that MiDAS’s TSU demands relatively small overheads and reduced power consumption, while various TSU operations adhere to low latency responses. To back said claims, the proposed DDM-based TSU is compared with the Task Superscalar architecture that implements the StarSs programming framework in hardware. As such, comparison results show that the proposed TSU requires much less of both hardware investment and energy consumption to operate. Specifically, Task Superscalar is found to be 4.94 ×  larger than the DDM-supporting TSU in terms of slice register requirements and 11.34 ×  larger with respect to the slice look-up table count. Last, the hardware TSU is compared with a software TSU implementation offering identical functionalities, with both being run on an FPGA fabric under a synthetic application, where a detailed performance evaluation shows that MiDAS’s hardware-implemented TSU significantly outperforms its software-based TSU counterpart.

[1]  Paraskevas Evripidou,et al.  Data-Driven Concurrency for High Performance Computing , 2017, ACM Trans. Archit. Code Optim..

[2]  Benoît Dupont de Dinechin,et al.  A Distributed Run-Time Environment for the Kalray MPPA®-256 Integrated Manycore Processor , 2013, ICCS.

[3]  Paraskevas Evripidou,et al.  Data-Driven Thread Execution on Heterogeneous Processors , 2016, International Journal of Parallel Programming.

[4]  Benoît Meister,et al.  The Open Community Runtime: A runtime system for extreme scale computing , 2016, 2016 IEEE High Performance Extreme Computing Conference (HPEC).

[5]  Hartmut Kaiser,et al.  HPX: A Task Based Programming Model in a Global Address Space , 2014, PGAS.

[6]  Alejandro Duran,et al.  Productive Cluster Programming with OmpSs , 2011, Euro-Par.

[7]  Ian Watson,et al.  The Manchester prototype dataflow computer , 1985, CACM.

[8]  Jack Dongarra,et al.  LAPACK Users' guide (third ed.) , 1999 .

[9]  Guang R. Gao,et al.  Application characterization at scale: lessons learned from developing a distributed open community runtime system for high performance computing , 2016, Conf. Computing Frontiers.

[10]  Arvind,et al.  The U-Interpreter , 1982, Computer.

[11]  Thomas Hérault,et al.  DAGuE: A Generic Distributed DAG Engine for High Performance Computing , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[12]  Eduard Ayguadé,et al.  Hierarchical Task-Based Programming With StarSs , 2009, Int. J. High Perform. Comput. Appl..

[13]  K. Pagiamtzis,et al.  Content-addressable memory (CAM) circuits and architectures: a tutorial and survey , 2006, IEEE Journal of Solid-State Circuits.

[14]  Paraskevas Evripidou,et al.  Verilog-based simulation of hardware support for data-flow concurrency on multicore systems , 2013, 2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS).

[15]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[16]  Paraskevas Evripidou Thread Synchronization Unit (TSU): A Building Block for High Performance Computers , 1997, ISHPC.

[17]  Krishna M. Kavi,et al.  Scheduled Dataflow: Execution Paradigm, Architecture, and Performance Evaluation , 2001, IEEE Trans. Computers.

[18]  Vítor Santos Costa,et al.  Trebuchet: exploring TLP with dataflow virtualisation , 2011, Int. J. High Perform. Syst. Archit..

[19]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[20]  Oliver Pell,et al.  Maximum Performance Computing with Dataflow Engines , 2012, Computing in Science & Engineering.

[21]  Eduard Ayguadé,et al.  Task Superscalar: An Out-of-Order Task Pipeline , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[22]  John Shalf,et al.  Exascale Computing Technology Challenges , 2010, VECPAR.

[23]  Paraskevas Evripidou,et al.  Data-Driven Multithreading Using Conventional Microprocessors , 2006, IEEE Transactions on Parallel and Distributed Systems.

[24]  Jack B. Dennis,et al.  First version of a data flow procedure language , 1974, Symposium on Programming.

[25]  Tom Feist,et al.  Vivado Design Suite , 2012 .

[26]  Roberto Giorgi,et al.  DTA-C: A Decoupled multi-Threaded Architecture for CMP Systems , 2007 .

[27]  Paraskevas Evripidou,et al.  DDM-VMc: the data-driven multithreading virtual machine for the cell processor , 2011, HiPEAC.

[28]  Paraskevas Evripidou,et al.  Architectural Support for Data-Driven Execution , 2015, ACM Trans. Archit. Code Optim..

[29]  Paraskevas Evripidou,et al.  Combining Compile and Run-Time Dependency Resolution in Data-Driven Multithreading , 2011, 2011 First Workshop on Data-Flow Execution Models for Extreme Scale Computing.

[30]  Paraskevas Evripidou,et al.  Programming multi-core architectures using Data-Flow techniques , 2010, 2010 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation.

[31]  Paraskevas Evripidou,et al.  TFlux: A Portable Platform for Data-Driven Multithreading on Commodity Multicore Systems , 2008, 2008 37th International Conference on Parallel Processing.

[32]  Kathleen Knobe,et al.  Ease of use with concurrent collections (CnC) , 2009 .

[33]  Anthony Skjellum,et al.  Using MPI - portable parallel programming with the message-parsing interface , 1994 .

[34]  Mohammad Reza Selim,et al.  Carrying on the legacy of imperative languages in the future parallel computing era , 2014, Parallel Comput..

[35]  Lizy Kurian John,et al.  Scaling to the end of silicon with EDGE architectures , 2004, Computer.

[36]  Arvind,et al.  Executing a Program on the MIT Tagged-Token Dataflow Architecture , 1990, IEEE Trans. Computers.

[37]  Alejandro Duran,et al.  Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..

[38]  Margaret H. Wright,et al.  The opportunities and challenges of exascale computing , 2010 .

[39]  Arvind,et al.  Two Fundamental Issues in Multiprocessing , 1987, Parallel Computing in Science and Engineering.

[40]  Gurindar S. Sohi,et al.  Dataflow execution of sequential imperative programs on multicore architectures , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[41]  Arvind,et al.  Some Relationships Between Asynchronous Interpreters of a Dataflow Language , 1977, Formal Description of Programming Concepts.

[42]  George Bosilca,et al.  PaRSEC in Practice: Optimizing a Legacy Chemistry Application through Distributed Task-Based Execution , 2015, 2015 IEEE International Conference on Cluster Computing.

[43]  Vivek Sarkar,et al.  Declarative aspects of memory management in the concurrent collections parallel programming model , 2009, DAMP '09.

[44]  Simha Sethumadhavan,et al.  Distributed Microarchitectural Protocols in the TRIPS Prototype Processor , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[45]  Eduard Ayguadé,et al.  OmpSs-OpenCL Programming Model for Heterogeneous Systems , 2012, LCPC.

[46]  Eduard Ayguadé,et al.  Implementing OmpSs support for regions of data in architectures with multiple address spaces , 2013, ICS '13.

[47]  H. Peter Hofstee,et al.  Introduction to the Cell multiprocessor , 2005, IBM J. Res. Dev..

[48]  Vítor Santos Costa,et al.  Couillard: Parallel programming via coarse-grained Data-flow Compilation , 2011, Parallel Comput..

[49]  Thomas L. Sterling,et al.  ParalleX An Advanced Parallel Execution Model for Scaling-Impaired Applications , 2009, 2009 International Conference on Parallel Processing Workshops.