A scalable thread scheduling co-processor based on data-flow principles

Large synchronization and communication overhead will become a major concern in future extreme-scale machines (e.g., HPC systems, supercomputers). These systems will push upwards performance limits by adopting chips equipped with one order of magnitude more cores than today. Alternative execution models can be explored in order to exploit the high parallelism offered by future massive many-core chips. This paper proposes the integration of standard cores with dedicated co-processing units that enable the system to support a fine-grain data-flow execution model developed within the TERAFLUX project. An instruction set architecture extension for supporting fine-grain thread scheduling and execution is proposed. This instruction set extension is supported by the co-processor that provides hardware units for accelerating thread scheduling and distribution among the available cores. Two fundamental aspects are at the base of the proposed system: the programmers can adopt their preferred programming model, and the compilation tools can produce a large set of threads mainly communicating in a producer-consumer fashion, hence enabling data-flow execution. Experimental results demonstrate the feasibility of the proposed approach and its capability of scaling with the increasing number of cores. We present a data-flow based co-processor supporting the execution of fine-grain threads.We propose a minimalistic core ISA extension for data-flow threads.We propose a two-level hierarchical scheduling co-processor that implements the ISA extension.We show the scalability of the proposed system through a set of experimental results.

[1]  Michael Butler,et al.  Bulldozer: An Approach to Multithreaded Compute Performance , 2011, IEEE Micro.

[2]  Benoît Dupont de Dinechin,et al.  A clustered manycore processor architecture for embedded and accelerated applications , 2013, 2013 IEEE High Performance Extreme Computing Conference (HPEC).

[3]  U. Narayan Bhat A Formal Definition of DataFlow Graph Models , 2009 .

[4]  Sameer Kumar,et al.  Evaluating the effect of replacing CNK with linux on the compute-nodes of blue gene/l , 2008, ICS '08.

[5]  Thierry Gautier,et al.  KAAPI: A thread scheduling runtime system for data flow computations on cluster of multi-processors , 2007, PASCO '07.

[6]  Sandip Kundu,et al.  Online error detection and recovery in dataflow execution , 2014, 2014 IEEE 20th International On-Line Testing Symposium (IOLTS).

[7]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[8]  Albert Cohen,et al.  Automatic Extraction of Coarse-Grained Data-Flow Threads from Imperative Programs , 2012, IEEE Micro.

[9]  Avi Mendelson,et al.  The TERAFLUX Project: Exploiting the DataFlow Paradigm in Next Generation Teradevices , 2013, 2013 Euromicro Conference on Digital System Design.

[10]  Maurice Steinman,et al.  AMD Fusion APU: Llano , 2012, IEEE Micro.

[11]  Arvind,et al.  Executing a Program on the MIT Tagged-Token Dataflow Architecture , 1990, IEEE Trans. Computers.

[12]  Guang R. Gao,et al.  Exploiting fine-grain parallelism on dataflow architectures , 1990, Parallel Comput..

[13]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[14]  Oliver Pell,et al.  Maximum Performance Computing with Dataflow Engines , 2012, Computing in Science & Engineering.

[15]  John Glauert,et al.  SISAL: streams and iteration in a single assignment language. Language reference manual, Version 1. 2. Revision 1 , 1985 .

[16]  Avi Mendelson,et al.  A Fault Detection and Recovery Architecture for a Teradevice Dataflow System , 2011, 2011 First Workshop on Data-Flow Execution Models for Extreme Scale Computing.

[17]  Eduard Ayguadé,et al.  Task Superscalar: An Out-of-Order Task Pipeline , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[18]  Albert Cohen,et al.  A Stream-Comptuting Extension to OpenMP , 2010, IWOMP 2010.

[19]  Peter S. Pacheco Parallel programming with MPI , 1996 .

[20]  Avi Mendelson,et al.  Architectural Support for Fault Tolerance in a Teradevice Dataflow System , 2014, International Journal of Parallel Programming.

[21]  John Shalf,et al.  The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..

[22]  John Glauert,et al.  SISAL: streams and iteration in a single-assignment language. Language reference manual, Version 1. 1 , 1983 .

[23]  Emmett Kilgariff,et al.  Fermi GF100 GPU Architecture , 2011, IEEE Micro.

[24]  Paolo Faraboschi,et al.  An Introduction to DF-Threads and their Execution Model , 2014, 2014 International Symposium on Computer Architecture and High Performance Computing Workshop.

[25]  Krishna M. Kavi,et al.  Isomorphisms Between Petr Nets and Dataflow Graphs , 1987, IEEE Transactions on Software Engineering.

[26]  Ali R. Hurson,et al.  Issues in Dataflow Computing , 1993, Adv. Comput..

[27]  Roberto Giorgi,et al.  Exploiting DMA to enable non-blocking execution in Decoupled Threaded Architecture , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[28]  Gilles Kahn,et al.  The Semantics of a Simple Language for Parallel Programming , 1974, IFIP Congress.

[29]  Paolo Faraboschi,et al.  COTSon: infrastructure for full system simulation , 2009, OPSR.

[30]  Paolo Faraboschi,et al.  Simulating a Multi-core x8664 Architecture with Hardware ISA Extension Supporting a Data-Flow Execution Model , 2014, 2014 2nd International Conference on Artificial Intelligence, Modelling and Simulation.

[31]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[32]  Roberto Giorgi TERAFLUX: exploiting dataflow parallelism in teradevices , 2012, CF '12.

[33]  Steven Swanson,et al.  The WaveScalar architecture , 2007, TOCS.

[34]  Krishna M. Kavi,et al.  Scheduled Dataflow: Execution Paradigm, Architecture, and Performance Evaluation , 2001, IEEE Trans. Computers.

[35]  Albert Cohen,et al.  A stream-computing extension to OpenMP , 2011, HiPEAC.

[36]  Krishna M. Kavi,et al.  A Formal Definition of Data Flow Graph Models , 1986, IEEE Transactions on Computers.

[37]  Avi Mendelson,et al.  TERAFLUX: Harnessing dataflow in next generation teradevices , 2014, Microprocess. Microsystems.

[38]  Rohit Chandra,et al.  Parallel programming in openMP , 2000 .

[39]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[40]  Jack B. Dennis,et al.  A preliminary architecture for a basic data-flow processor , 1974, ISCA '75.

[41]  Jack J. Dongarra,et al.  From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming , 2012, Parallel Comput..

[42]  Lorenzo Verdoscia,et al.  A Clockless Computing System Based on the Static Dataflow Paradigm , 2014, 2014 Fourth Workshop on Data-Flow Execution Models for Extreme Scale Computing.

[43]  Rolf Riesen,et al.  CONCURRENCY AND COMPUTATION : PRACTICE AND EXPERIENCE Concurrency Computat , 2008 .

[44]  Roberto Giorgi,et al.  DTA-C: A Decoupled multi-Threaded Architecture for CMP Systems , 2007, 19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07).