论文信息 - Leveraging near data processing for high-performance checkpoint/restart

Leveraging near data processing for high-performance checkpoint/restart

With the increasing size of HPC systems, the system mean time to interrupt will decrease. This requires checkpoints to be stored in a smaller time when using checkpoint/restart (C/R) for mitigation. Multilevel checkpointing improves C/R efficiency by saving most checkpoints to fast compute-node local storage. But it incurs a high cost for writing a few checkpoints to slow global-I/O. We show that leveraging NDP to offload writing of checkpoints to global-I/O improves C/R efficiency. We explore additional opportunities using NDP to further reduce C/R overhead and evaluate checkpoint compression using NDP as a starting point. We evaluate the performance of our novel application of NDP for C/R and compare it to existing C/R optimizations. Our evaluation for a projected exascale system using multilevel checkpointing shows that with NDP, the host processor is able to increase its efficiency on an average from 51% to 78% (i.e., a >50% speedup in performance).

Gabriel H. Loh | James Tuck | Abhinav Agrawal

[1] J. Duell. The design and implementation of Berkeley Lab's linux checkpoint/restart , 2005 .

[2] Chanik Park,et al. Active disk meets flash: a case for intelligent SSDs , 2013, ICS '13.

[3] Steven Swanson,et al. Near-Data Processing: Insights from a MICRO-46 Workshop , 2014, IEEE Micro.

[4] Jung Ho Ahn,et al. Corona: System Implications of Emerging Nanophotonic Technology , 2008, 2008 International Symposium on Computer Architecture.

[5] Christian Engelmann,et al. Combining Partial Redundancy and Checkpointing for HPC , 2012, 2012 IEEE 32nd International Conference on Distributed Computing Systems.

[6] Ron Brightwell,et al. On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-Based Fault Tolerance , 2012, 2012 41st International Conference on Parallel Processing.

[7] Jinsuk Chung,et al. Containment domains: A scalable, efficient, and flexible resilience scheme for exascale systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[8] Kurt B. Ferreira,et al. A checkpoint compression study for high-performance computing systems , 2015, Int. J. High Perform. Comput. Appl..

[9] Yang Liu,et al. Willow: A User-Programmable SSD , 2014, OSDI.

[10] Mohamed S. Abdelfattah,et al. Gzip on a chip: high performance lossless data compression on FPGAs using OpenCL , 2014, IWOCL '14.

[11] Rolf Riesen,et al. libhashckpt: Hash-Based Incremental Checkpointing Using GPU's , 2011, EuroMPI.

[12] Yong Chen,et al. Towards scalable I/O architecture for exascale systems , 2011, MTAGS '11.

[13] David J. DeWitt,et al. Query processing on smart SSDs: opportunities and challenges , 2013, SIGMOD '13.

[14] Chanik Park,et al. Enabling cost-effective data processing with smart SSD , 2013, 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST).

[15] Surendra Byna,et al. Accelerating Science with the NERSC Burst Buffer Early User Program , 2016 .

[16] Bianca Schroeder,et al. Understanding failures in petascale computers , 2007 .

[17] James H. Laros,et al. Redundant computing for exascale systems. , 2010 .

[18] John T. Daly,et al. A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[19] Yuan Xie,et al. Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[20] Peter Desnoyers,et al. Active flash: towards energy-efficient, in-situ data analytics on extreme-scale machines , 2013, FAST.

[21] Laxmikant V. Kalé,et al. FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[22] Andrew Lumsdaine,et al. The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[23] Jason Duell,et al. The design and implementation of Berkeley Lab's linuxcheckpoint/restart , 2005 .

[24] Franck Cappello,et al. Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[25] Kiyoung Choi,et al. A scalable processing-in-memory accelerator for parallel graph processing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[26] Sandia Report,et al. Improving Performance via Mini-applications , 2009 .

[27] Bogdan Nicolae,et al. Towards Scalable Checkpoint Restart: A Collective Inline Memory Contents Deduplication Proposal , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[28] Jinyoung Lee,et al. Biscuit: A Framework for Near-Data Processing of Big Data Workloads , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[29] Dhabaleswar K. Panda,et al. A 1 PB/s file system to checkpoint three million MPI tasks , 2013, HPDC.

[30] André Brinkmann,et al. Deduplication Potential of HPC Applications’ Checkpoints , 2016, 2016 IEEE International Conference on Cluster Computing (CLUSTER).

[31] Sungroh Yoon,et al. Near-Data Processing for Machine Learning , 2016, ArXiv.

[32] Milos Prvulovic,et al. Euripus: A flexible unified hardware memory checkpointing accelerator for bidirectional-debugging and reliability , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[33] Peter Deutsch,et al. DEFLATE Compressed Data Format Specification version 1.3 , 1996, RFC.

[34] Bronis R. de Supinski,et al. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[35] John Shalf,et al. The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..

[36] Jim Rogers. Power Efficiency and Performance with ORNL's Cray XK7 Titan , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[37] George Bosilca,et al. Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[38] Kurt B. Ferreira,et al. Keeping checkpoint/restart viable for exascale systems , 2011 .

[39] Kurt B. Ferreira,et al. On the Viability of Checkpoint Compression for Extreme Scale Fault Tolerance , 2011, Euro-Par Workshops.

[40] Scott Klasky,et al. Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[41] Luca Benini,et al. Design and Evaluation of a Processing-in-Memory Architecture for the Smart Memory Cube , 2016, ARCS.

[42] Dejan S. Milojicic,et al. Optimizing Checkpoints Using NVM as Virtual Memory , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[43] John Bent,et al. PLFS: a checkpoint filesystem for parallel applications , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.