A Runtime Heuristic to Selectively Replicate Tasks for Application-Specific Reliability Targets

In this paper we propose a runtime-based selective task replication technique for task-parallel high performance computing applications. Our selective task replication technique is automatic and does not require modification/recompilation of OS, compiler or application code. Our heuristic, we call App_FIT, selects tasks to replicate such that the specified reliability target for an application is achieved. In our experimental evaluation, we show that App FIT selective replication heuristic is low-overhead and highly scalable. In addition, results indicate that complete task replication is overkill for achieving reliability targets. We show that with App FIT, we can tolerate pessimistic exascale error rates with only 53% of the tasks being replicated.

[1]  Franck Cappello,et al.  Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..

[2]  Robert S. Swarz,et al.  Reliable Computer Systems: Design and Evaluation , 1992 .

[3]  John Shalf,et al.  Exascale Computing Technology Challenges , 2010, VECPAR.

[4]  Paolo Toth,et al.  Knapsack Problems: Algorithms and Computer Implementations , 1990 .

[5]  Polyvios Pratikakis,et al.  BDDT:: block-level dynamic dependence analysisfor deterministic task-based parallelism , 2012, PPoPP '12.

[6]  S. E. Michalak,et al.  Assessment of the Impact of Cosmic-Ray-Induced Neutrons on Hardware in the Roadrunner Supercomputer , 2012, IEEE Transactions on Device and Materials Reliability.

[7]  Jie Liu,et al.  Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[8]  Barbara M. Chapman,et al.  A Prototype Implementation of OpenMP Task Dependency Support , 2013, IWOMP.

[9]  Chi Ching Chi,et al.  A Benchmark Suite for Evaluating Parallel Programming Models: Introduction and Preliminary Results , 2011 .

[10]  D. DeMets,et al.  Data integrity. , 2020, Controlled clinical trials.

[11]  Chi Ching Chi,et al.  A Benchmark Suite for Evaluating Parallel Programming Models , 2011 .

[12]  Melvin E. Conway,et al.  A multiprocessor system design , 1899, AFIPS '63 (Fall).

[13]  Martin Schulz,et al.  IPAS: Intelligent protection against silent output corruption in scientific applications , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[14]  David A. Wood,et al.  ASR: Adaptive Selective Replication for CMP Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[15]  Satoshi Matsuoka,et al.  Fork-Join and Data-Driven Execution Models on Multi-core Architectures: Case Study of the FMM , 2013, ISC.

[16]  Joel S. Emer,et al.  Techniques to reduce the soft error rate of a high-performance microprocessor , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[17]  Franck Cappello,et al.  Spatial Support Vector Regression to Detect Silent Errors in the Exascale Era , 2016, 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid).

[18]  Bo Fang,et al.  Evaluating the Error Resilience of Parallel Programs , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[19]  David Fiala Detection and correction of silent data corruption for large-scale high-performance computing , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[20]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[21]  Franck Cappello,et al.  Fault-Tolerant Protocol for Hybrid Task-Parallel Message-Passing Applications , 2015, 2015 IEEE International Conference on Cluster Computing.

[22]  Eduard Ayguadé,et al.  Programmability and portability for exascale: Top down programming methodology and tools with StarSs , 2013, J. Comput. Sci..

[23]  Ben H. H. Juurlink,et al.  Using OpenMP superscalar for parallelization of embedded and consumer applications , 2012, 2012 International Conference on Embedded Computer Systems (SAMOS).

[24]  Franck Cappello,et al.  FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[25]  Eduard Ayguadé,et al.  PARSECSs: Evaluating the Impact of Task Parallelism in the PARSEC Benchmark Suite , 2016, ACM Trans. Archit. Code Optim..

[26]  Mikko H. Lipasti,et al.  Silent stores for free , 2000, MICRO 33.

[27]  Osman S. Unsal,et al.  NanoCheckpoints: A Task-Based Asynchronous Dataflow Framework for Efficient and Scalable Checkpoint/Restart , 2015, 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[28]  Omer Subasi,et al.  Programmer-directed partial redundancy for resilient HPC , 2015, Conf. Computing Frontiers.

[29]  Franck Cappello,et al.  Toward Exascale Resilience , 2009, Int. J. High Perform. Comput. Appl..

[30]  Carol Lochbaum,et al.  A block diagram compiler , 1961 .

[31]  Eduard Ayguadé,et al.  Overlapping communication and computation by using a hybrid MPI/SMPSs approach , 2010, ICS '10.

[32]  Pascal Felber,et al.  Adaptive Selective Replication for Complex Event Processing Systems , 2013, BD3@VLDB.

[33]  Alejandro Duran,et al.  Support for OpenMP tasks in Nanos v4 , 2007, CASCON.

[34]  Alejandro Duran,et al.  Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..

[35]  John Shalf,et al.  The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..

[36]  Vilas Sridharan,et al.  A study of DRAM failures in the field , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[37]  Amin Ansari,et al.  Shoestring: probabilistic soft error reliability on the cheap , 2010, ASPLOS XV.

[38]  Mehdi Baradaran Tahoori,et al.  A layout-based approach for Multiple Event Transient analysis , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).