FT-Offload: A Scalable Fault-Tolerance Programing Model on MIC Cluster

Massively heterogeneous architectures are popular for modern petascale and future exascale systems. Fault-tolerance is key to the increased number of components and the complexity of these heterogeneous systems. However, standard offload programming models have traditionally been developed for supporting high performance rather than reliability. Naive fault-tolerance protocols are incapable of serving distributed MPI applications that tuned for CPU-MIC heterogeneous clusters. To address these problems, we design and implement a framework of fault tolerance programming model (FT-Offload). This enhances the reliability of heterogeneous supercomputers and retains the convenient of popular Intel Offload programming model. The effectiveness of the framework is demonstrated via numerical analysis and by porting both benchmarks and real-world applications to large-scale CPU-MIC nodes on the Tianhe-2 supercomputer. Our experimental results show that the current solution, which involves checkpoints, can efficiently strength the long running and reduce checkpointing overhead.

[1]  Rida A. Bazzi,et al.  Heterogeneous checkpointing for multithreaded applications , 2002, 21st IEEE Symposium on Reliable Distributed Systems, 2002. Proceedings..

[2]  Zizhong Chen Algorithm-based recovery for iterative methods without checkpointing , 2011, HPDC '11.

[3]  Shant Shahbazian,et al.  Revisiting the foundations of quantum theory of atoms in molecules (QTAIM): The variational procedure and the zero‐flux conditions , 2008 .

[4]  Dhabaleswar K. Panda,et al.  MIC-Check: a distributed check pointing framework for the intel many integrated cores architecture , 2014, HPDC '14.

[5]  Naga K. Govindaraju,et al.  GPGPU: general-purpose computation on graphics hardware , 2006, SC.

[6]  Canqun Yang,et al.  MilkyWay-2 supercomputer: system and application , 2014, Frontiers of Computer Science.

[7]  Kai Lu,et al.  Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing , 2010, 2010 IEEE International Conference on Cluster Computing.

[8]  Wang Feng,et al.  Programming for scientific computing on peta-scale heterogeneous parallel systems , 2013 .

[9]  Hiroaki Kobayashi,et al.  CheCUDA: A Checkpoint/Restart Tool for CUDA Applications , 2009, 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies.

[10]  Kevin Skadron,et al.  A hardware redundancy and recovery mechanism for reliable scientific computation on graphics processors , 2007, GH '07.

[11]  Xing Cai,et al.  Communication‐hiding programming for clusters with multi‐coprocessor nodes , 2015, Concurr. Comput. Pract. Exp..

[12]  Laxmikant V. Kale,et al.  Charm++ and AMPI: Adaptive Runtime Strategies via Migratable Objects , 2009 .

[13]  David Kirk,et al.  NVIDIA cuda software and gpu parallel computing architecture , 2007, ISMM '07.

[14]  Jack J. Dongarra,et al.  FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World , 2000, PVM/MPI.

[15]  Tao Tang,et al.  OpenMC: Towards Simplifying Programming for TianHe Supercomputers , 2014, Journal of Computer Science and Technology.

[16]  Jingling Xue,et al.  PartialRC: A Partial Recomputing Method for Efficient Fault Recovery on GPGPUs , 2012, Journal of Computer Science and Technology.

[17]  Hiroaki Kobayashi,et al.  CheCL: Transparent Checkpointing and Process Migration of OpenCL Applications , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[18]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[19]  Laxmikant V. Kalé,et al.  A scalable double in-memory checkpoint and restart scheme towards exascale , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012).

[20]  Georg Stellner,et al.  CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.

[21]  Kai Lu,et al.  The TianHe-1A Supercomputer: Its Hardware and Software , 2011, Journal of Computer Science and Technology.

[22]  Roy Friedman,et al.  Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[23]  Daniel Marques,et al.  Automated application-level checkpointing of MPI programs , 2003, PPoPP '03.

[24]  Jason Duell,et al.  Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters , 2006 .

[25]  Richard Barrett,et al.  Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods , 1994, Other Titles in Applied Mathematics.