An Introduction to DF-Threads and their Execution Model

Current computing systems are mostly focused on achieving performance, programmability, energy efficiency, resiliency by essentially trying to replicate the uni-core execution model n-times in parallel on a multi/many-core system. This choice has heavily conditioned the way both software and hardware are designed nowadays. However, as old as computer architecture is the concept of dataflow, that is "initiating an activity in presence of the data it needs to perform its function" [J. Dennis]. Dataflow had been historically initially explored at instruction level and has led to the major result of the realization of current superscalar processors, which implement a form of "restricted dataflow" at instruction level. In this paper, we illustrate the idea of using the dataflow concept to define novel thread types that we call Data-Flow-Threads or DF-Threads. The advantages we are aiming at regard several aspects, not fully explored yet: i) isolating the computations so that communication patterns can be more efficiently managed by a not-so-complex architecture, ii) possibility to repeat the execution of a thread in case of detected faults affecting the thread resources, iii) providing a minimalistic low-level API for allowing compilers and programmers to map their parallel codes and architects to implement more efficient and scalable systems. The semantics of DF-Threads is also tightly connected to their execution model, hereby illustrated. Several other efforts have been done with similar purposes since the introduction of macro-dataflow through the more recent DF-Codelets and the OCR project. In our case, we aim at a more complete model with the above advantages and in particular including the way of managing the mutable shared state by relying on the transactional memory semantics. Our initial experiments show how to map some simple kernel and the scalability potential on a futuristic 1k-core many-core.

[1]  Alejandro Duran,et al.  Extending the OpenMP Tasking Model to Allow Dependent Tasks , 2008, IWOMP.

[2]  Benoît Meister,et al.  Runnemede: An architecture for Ubiquitous High-Performance Computing , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[3]  Avi Mendelson,et al.  Architectural Support for Fault Tolerance in a Teradevice Dataflow System , 2014, International Journal of Parallel Programming.

[4]  Basilio B. Fraguela,et al.  The Hierarchically Tiled Arrays programming approach , 2004, LCR.

[5]  Roberto Giorgi TERAFLUX: exploiting dataflow parallelism in teradevices , 2012, CF '12.

[6]  Roberto Giorgi,et al.  DTA-C: A Decoupled multi-Threaded Architecture for CMP Systems , 2007 .

[7]  Avi Mendelson,et al.  The TERAFLUX Project: Exploiting the DataFlow Paradigm in Next Generation Teradevices , 2013, 2013 Euromicro Conference on Digital System Design.

[8]  Yoav Etsion,et al.  Hybrid Dataflow/von-Neumann Architectures , 2014, IEEE Transactions on Parallel and Distributed Systems.

[9]  Maurice Herlihy,et al.  Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[10]  Roberto Giorgi,et al.  DTA-C: A Decoupled multi-Threaded Architecture for CMP Systems , 2007, 19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07).

[11]  Albert Cohen,et al.  OpenStream: Expressiveness and data-flow compilation of OpenMP streaming programs , 2012, TACO.

[12]  Paolo Faraboschi,et al.  COTSon: infrastructure for full system simulation , 2009, OPSR.

[13]  Vivek Sarkar,et al.  Partitioning parallel programs for macro-dataflow , 1986, LFP '86.

[14]  John Sargeant,et al.  Stored data structures on the Manchester dataflow machine , 1986, ISCA 1986.

[15]  Avi Mendelson,et al.  A Fault Detection and Recovery Architecture for a Teradevice Dataflow System , 2011, 2011 First Workshop on Data-Flow Execution Models for Extreme Scale Computing.

[16]  Maurice Herlihy,et al.  Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[17]  Kunle Olukotun,et al.  Transactional memory coherence and consistency , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[18]  Koushik Chakraborty,et al.  Computation spreading: employing hardware migration to specialize CMP cores on-the-fly , 2006, ASPLOS XII.

[19]  John Shalf,et al.  The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..

[20]  Onur Mutlu,et al.  Accelerating critical section execution with asymmetric multi-core architectures , 2009, ASPLOS.

[21]  Avi Mendelson,et al.  TERAFLUX: Harnessing dataflow in next generation teradevices , 2014, Microprocess. Microsystems.

[22]  Chris C. Kirkham,et al.  Stored Data Structures on the Manchester Dataflow Machine , 1986, ISCA.