VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale

Global checkpointing to external storage (e.g., a parallel file system) is a common I/O pattern of many HPC applications. However, given the limited I/O throughput of external storage, global checkpointing can often lead to I/O bottlenecks. To address this issue, a shift from synchronous checkpointing (i.e., blocking until writes have finished) to asynchronous checkpointing (i.e., writing to faster local storage and flushing to external storage in the background) is increasingly being adopted. However, with rising core count per node and heterogeneity of both local and external storage, it is non trivial to design efficient asynchronous checkpointing mechanisms due to the complex interplay between high concurrency and I/O performance variability at both the node-local and global levels. This problem is not well understood but highly important for modern supercomputing infrastructures. This paper proposes a versatile asynchronous checkpointing solution that addresses this problem. To this end, we introduce a concurrency-optimized technique that combines performance modeling with lightweight monitoring to make informed decisions about what local storage devices to use in order to dynamically adapt to background flushes and reduce the checkpointing overhead. We illustrate this technique using the VeloC prototype. Extensive experiments on a pre-Exascale supercomputing system show significant benefits.

[1]  Song Jiang,et al.  Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[2]  F. Moore,et al.  Polynomial Codes Over Certain Finite Fields , 2017 .

[3]  Dhabaleswar K. Panda,et al.  Triple-H: A Hybrid Approach to Accelerate HDFS on HPC Clusters with Heterogeneous Storage Architecture , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[4]  Franck Cappello,et al.  FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[5]  Franck Cappello,et al.  Damaris: How to Efficiently Leverage Multicore Parallelism to Achieve Scalable, Jitter-free I/O , 2012, 2012 IEEE International Conference on Cluster Computing.

[6]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[7]  Franck Cappello,et al.  AI-Ckpt: leveraging memory access patterns for adaptive asynchronous incremental checkpointing , 2013, HPDC.

[8]  Bogdan Nicolae,et al.  Leveraging Naturally Distributed Data Redundancy to Reduce Collective I/O Replication Overhead , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[9]  Lei Cao,et al.  To share or not to share: comparing burst buffer architectures , 2017, SpringSim.

[10]  Christopher J. Hughes,et al.  Location-aware cache management for many-core processors with deep cache hierarchy , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[11]  David Abrahams,et al.  C++ Template Metaprogramming: Concepts, Tools, and Techniques from Boost and Beyond (C++ In-Depth Series) , 2004 .

[12]  Dhabaleswar K. Panda,et al.  A Comprehensive Study of MapReduce Over Lustre for Intermediate Data Placement and Shuffle Strategies on HPC Clusters , 2017, IEEE Transactions on Parallel and Distributed Systems.

[13]  Bogdan Nicolae,et al.  Towards Scalable Checkpoint Restart: A Collective Inline Memory Contents Deduplication Proposal , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[14]  Hal Finkel,et al.  HACC: Simulating Sky Surveys on State-of-the-Art Supercomputing Architectures , 2014, 1410.2805.

[15]  Daniel Sánchez,et al.  Jenga: Software-defined cache hierarchies , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[16]  Bogdan Nicolae,et al.  On the Benefits of Transparent Compression for Cost-Effective Cloud Data Storage , 2011, Trans. Large Scale Data Knowl. Centered Syst..

[17]  Hal Finkel,et al.  HACC , 2016, Commun. ACM.

[18]  Frank Mueller,et al.  Comparing different approaches for Incremental Checkpointing : The Showdown , 2011 .

[19]  Parthasarathy Ranganathan,et al.  Exploring latency-power tradeoffs in deep nonvolatile memory hierarchies , 2012, CF '12.

[20]  Robert B. Ross,et al.  Optimizing I/O forwarding techniques for extreme-scale event tracing , 2014, Cluster Computing.

[21]  Dhabaleswar K. Panda,et al.  A 1 PB/s file system to checkpoint three million MPI tasks , 2013, HPDC.

[22]  George Kurian,et al.  LDAC , 2016, ACM Trans. Archit. Code Optim..