Leveraging near data processing for high-performance checkpoint/restart

With the increasing size of HPC systems, the system mean time to interrupt will decrease. This requires checkpoints to be stored in a smaller time when using checkpoint/restart (C/R) for mitigation. Multilevel checkpointing improves C/R efficiency by saving most checkpoints to fast compute-node local storage. But it incurs a high cost for writing a few checkpoints to slow global-I/O. We show that leveraging NDP to offload writing of checkpoints to global-I/O improves C/R efficiency. We explore additional opportunities using NDP to further reduce C/R overhead and evaluate checkpoint compression using NDP as a starting point. We evaluate the performance of our novel application of NDP for C/R and compare it to existing C/R optimizations. Our evaluation for a projected exascale system using multilevel checkpointing shows that with NDP, the host processor is able to increase its efficiency on an average from 51% to 78% (i.e., a >50% speedup in performance).

[1]  J. Duell The design and implementation of Berkeley Lab's linux checkpoint/restart , 2005 .

[2]  Chanik Park,et al.  Active disk meets flash: a case for intelligent SSDs , 2013, ICS '13.

[3]  Steven Swanson,et al.  Near-Data Processing: Insights from a MICRO-46 Workshop , 2014, IEEE Micro.

[4]  Jung Ho Ahn,et al.  Corona: System Implications of Emerging Nanophotonic Technology , 2008, 2008 International Symposium on Computer Architecture.

[5]  Christian Engelmann,et al.  Combining Partial Redundancy and Checkpointing for HPC , 2012, 2012 IEEE 32nd International Conference on Distributed Computing Systems.

[6]  Ron Brightwell,et al.  On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-Based Fault Tolerance , 2012, 2012 41st International Conference on Parallel Processing.

[7]  Jinsuk Chung,et al.  Containment domains: A scalable, efficient, and flexible resilience scheme for exascale systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Kurt B. Ferreira,et al.  A checkpoint compression study for high-performance computing systems , 2015, Int. J. High Perform. Comput. Appl..

[9]  Yang Liu,et al.  Willow: A User-Programmable SSD , 2014, OSDI.

[10]  Mohamed S. Abdelfattah,et al.  Gzip on a chip: high performance lossless data compression on FPGAs using OpenCL , 2014, IWOCL '14.

[11]  Rolf Riesen,et al.  libhashckpt: Hash-Based Incremental Checkpointing Using GPU's , 2011, EuroMPI.

[12]  Yong Chen,et al.  Towards scalable I/O architecture for exascale systems , 2011, MTAGS '11.

[13]  David J. DeWitt,et al.  Query processing on smart SSDs: opportunities and challenges , 2013, SIGMOD '13.

[14]  Chanik Park,et al.  Enabling cost-effective data processing with smart SSD , 2013, 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST).

[15]  Surendra Byna,et al.  Accelerating Science with the NERSC Burst Buffer Early User Program , 2016 .

[16]  Bianca Schroeder,et al.  Understanding failures in petascale computers , 2007 .

[17]  James H. Laros,et al.  Redundant computing for exascale systems. , 2010 .

[18]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[19]  Yuan Xie,et al.  Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[20]  Peter Desnoyers,et al.  Active flash: towards energy-efficient, in-situ data analytics on extreme-scale machines , 2013, FAST.

[21]  Laxmikant V. Kalé,et al.  FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[22]  Andrew Lumsdaine,et al.  The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[23]  Jason Duell,et al.  The design and implementation of Berkeley Lab's linuxcheckpoint/restart , 2005 .

[24]  Franck Cappello,et al.  Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[25]  Kiyoung Choi,et al.  A scalable processing-in-memory accelerator for parallel graph processing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[26]  Sandia Report,et al.  Improving Performance via Mini-applications , 2009 .

[27]  Bogdan Nicolae,et al.  Towards Scalable Checkpoint Restart: A Collective Inline Memory Contents Deduplication Proposal , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[28]  Jinyoung Lee,et al.  Biscuit: A Framework for Near-Data Processing of Big Data Workloads , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[29]  Dhabaleswar K. Panda,et al.  A 1 PB/s file system to checkpoint three million MPI tasks , 2013, HPDC.

[30]  André Brinkmann,et al.  Deduplication Potential of HPC Applications’ Checkpoints , 2016, 2016 IEEE International Conference on Cluster Computing (CLUSTER).

[31]  Sungroh Yoon,et al.  Near-Data Processing for Machine Learning , 2016, ArXiv.

[32]  Milos Prvulovic,et al.  Euripus: A flexible unified hardware memory checkpointing accelerator for bidirectional-debugging and reliability , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[33]  Peter Deutsch,et al.  DEFLATE Compressed Data Format Specification version 1.3 , 1996, RFC.

[34]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[35]  John Shalf,et al.  The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..

[36]  Jim Rogers Power Efficiency and Performance with ORNL's Cray XK7 Titan , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[37]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[38]  Kurt B. Ferreira,et al.  Keeping checkpoint/restart viable for exascale systems , 2011 .

[39]  Kurt B. Ferreira,et al.  On the Viability of Checkpoint Compression for Extreme Scale Fault Tolerance , 2011, Euro-Par Workshops.

[40]  Scott Klasky,et al.  Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[41]  Luca Benini,et al.  Design and Evaluation of a Processing-in-Memory Architecture for the Smart Memory Cube , 2016, ARCS.

[42]  Dejan S. Milojicic,et al.  Optimizing Checkpoints Using NVM as Virtual Memory , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[43]  John Bent,et al.  PLFS: a checkpoint filesystem for parallel applications , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.