A resilient scheduler for dataflow execution

As processor manufacturing companies shifted to chips with an ever-increasing number of cores, creating a tangible way for average programmers to exploit parallelism became imperative. The scientific community is in a quest to create programming models that would make it easier to describe tasks and interaction between them. On the other hand, as the number of cores increases, so does the chance of having a fault in a core, so it is also important to provide resiliency to these programming models. DFER was shown to be a good fit to take advantage of dataflow programming while introducing resiliency to transient faults inside dataflow task execution. However, although most of the computing time of the dataflow system is spent in task execution, it is also desirable to provide fault tolerance in scheduling operations. This paper introduces novel techniques that incorporate a level of resiliency to the dataflow task scheduler in DFER. Experiments with two different approaches for achieving resiliency in the scheduler show promising results that take DFER one step further towards reliability.

[1]  Lasse Natvig,et al.  Partnership for Advanced Computing in Europe Implementation of an Energy-Aware OmpSs Task Scheduling Policy , 2013 .

[2]  Edward J. McCluskey,et al.  Error detection by duplicated instructions in super-scalar processors , 2002, IEEE Trans. Reliab..

[3]  James E. Smith,et al.  Configurable isolation: building high availability systems with commodity multi-core processors , 2007, ISCA '07.

[4]  Paraskevas Evripidou,et al.  Programming Abstractions and Toolchain for Dataflow Multithreading Architectures , 2009, 2009 Eighth International Symposium on Parallel and Distributed Computing.

[5]  Alejandro Duran,et al.  Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..

[6]  Krishna M. Kavi,et al.  Scheduled Dataflow: Execution Paradigm, Architecture, and Performance Evaluation , 2001, IEEE Trans. Computers.

[7]  Sandip Kundu,et al.  Online error detection and recovery in dataflow execution , 2014, 2014 IEEE 20th International On-Line Testing Symposium (IOLTS).

[8]  Thomas Hérault,et al.  DAGuE: A Generic Distributed DAG Engine for High Performance Computing , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[9]  Sarita V. Adve,et al.  Architectures for online error detection and recovery in multicore processors , 2011, 2011 Design, Automation & Test in Europe.

[10]  Lorenzo Alvisi,et al.  Modeling the effect of technology trends on the soft error rate of combinational logic , 2002, Proceedings International Conference on Dependable Systems and Networks.

[11]  Avi Mendelson,et al.  The TERAFLUX Project: Exploiting the DataFlow Paradigm in Next Generation Teradevices , 2013, 2013 Euromicro Conference on Digital System Design.

[12]  Daniel J. Sorin,et al.  Fault Tolerant Computer Architecture , 2009, Fault Tolerant Computer Architecture.

[13]  Sandip Kundu,et al.  Domino effect protection on dataflow error detection and recovery , 2014, 2014 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT).

[14]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[15]  R. Stephenson A and V , 1962, The British journal of ophthalmology.

[16]  W. Marsden I and J , 2012 .

[17]  Vítor Santos Costa,et al.  TALM: A Hybrid Execution Model with Distributed Speculation Support , 2010, 2010 22nd International Symposium on Computer Architecture and High Performance Computing Workshops.

[18]  Vítor Santos Costa,et al.  Trebuchet: exploring TLP with dataflow virtualisation , 2011, Int. J. High Perform. Syst. Archit..

[19]  Eric Rotenberg,et al.  AR-SMT: a microarchitectural approach to fault tolerance in microprocessors , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).