Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms

In high-performance computing environments, input/output (I/O) from various sources often contend for scarce available bandwidth. Adding to the I/O operations inherent to the failure-free execution of an application, I/O from checkpoint/restart (CR) operations (used to ensure progress in the presence of failures) place an additional burden as it increase I/O contention, leading to degraded performance. In this work, we consider a cooperative scheduling policy that optimizes the overall performance of concurrently executing CR-based applications which share valuable I/O resources. First, we provide a theoretical model and then derive a set of necessary constraints needed to minimize the global waste on the platform. Our results demonstrate that the optimal checkpoint interval, as defined by Young/Daly, despite providing a sensible metric for a single application, is not sufficient to optimally address resource contention at the platform scale. We therefore show that combining optimal checkpointing periods with I/O scheduling strategies can provide a significant improvement on the overall application performance, thereby maximizing platform throughput. Overall, these results provide critical analysis and direct guidance on checkpointing large-scale workloads in the presence of competing I/O while minimizing the impact on application performance.

[1]  Kai Li,et al.  Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..

[2]  Rolf Riesen,et al.  libhashckpt: Hash-Based Incremental Checkpointing Using GPU's , 2011, EuroMPI.

[3]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[4]  Jia Wang,et al.  I/O-Aware Batch Scheduling for Petascale Computing Systems , 2015, 2015 IEEE International Conference on Cluster Computing.

[5]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Robert Latham,et al.  Leveraging burst buffer coordination to prevent I/O interference , 2016, 2016 IEEE 12th International Conference on e-Science (e-Science).

[7]  Guillaume Aupy,et al.  Periodic I/O Scheduling for Super-Computers , 2017, PMBS@SC.

[8]  Ron Brightwell,et al.  On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-Based Fault Tolerance , 2012, 2012 41st International Conference on Parallel Processing.

[9]  Satoshi Matsuoka,et al.  Exploration of Lossy Compression for Application-Level Checkpoint/Restart , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[10]  Kai Li,et al.  ickp: a consistent checkpointer for multicomputers , 1994, IEEE Parallel & Distributed Technology: Systems & Applications.

[11]  Nitin H. Vaidya,et al.  A case for two-level distributed recovery schemes , 1995, SIGMETRICS '95/PERFORMANCE '95.

[12]  W. Kent Fuchs,et al.  CATCH-compiler-assisted techniques for checkpointing , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[13]  Robert B. Ross,et al.  CALCioM: Mitigating I/O Interference in HPC Systems through Cross-Application Coordination , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[14]  Dong H. Ahn,et al.  Scalable I/O-Aware Job Scheduling for Burst Buffer Enabled HPC Clusters , 2016, HPDC.

[15]  Franck Cappello,et al.  FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[16]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[17]  Stephen L. Scott,et al.  Incremental Checkpoint Schemes for Weibull Failure Distribution , 2010, Int. J. Found. Comput. Sci..

[18]  Marianne Winslett,et al.  A Multiplatform Study of I/O Behavior on Petascale Supercomputers , 2015, HPDC.

[19]  Willy Zwaenepoel,et al.  The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[20]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[21]  Kai Li,et al.  Memory Exclusion: Optimizing the Performance of Checkpointing Systems , 1999, Softw. Pract. Exp..

[22]  Daniel Marques,et al.  Compiler-enhanced incremental checkpointing for OpenMP applications , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[23]  Saurabh Gupta,et al.  Lazy Checkpointing: Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[24]  Laxmikant V. Kalé,et al.  ACR: Automatic checkpoint/restart for soft and hard error protection , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[25]  Malcolm P. Atkinson,et al.  Rethinking High Performance Computing Platforms: Challenges, Opportunities and Recommendations , 2016, DIDC@HPDC.

[26]  Katherine E. Isaacs,et al.  There goes the neighborhood: Performance degradation due to nearby jobs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[27]  Jeffrey F. Naughton,et al.  Low-Latency, Concurrent Checkpointing for Parallel Programs , 1994, IEEE Trans. Parallel Distributed Syst..

[28]  Larry Rudolph,et al.  Cooperative checkpointing: a robust approach to large-scale systems reliability , 2006, ICS '06.

[29]  Lavanya Ramakrishnan,et al.  AnalyzeThis: an analysis workflow-aware storage system , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[30]  Robert B. Ross,et al.  On the Root Causes of Cross-Application I/O Interference in HPC Storage Systems , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[31]  Bronis R. de Supinski,et al.  MCREngine: A scalable checkpointing system using data-aware aggregation and compression , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[32]  Matei Ripeanu,et al.  stdchk: A Checkpoint Storage System for Desktop Grid Computing , 2007, 2008 The 28th International Conference on Distributed Computing Systems.

[33]  Masaru Kitsuregawa,et al.  Modeling I/O interference for data intensive distributed applications , 2013, SAC '13.

[34]  Robert B. Ross,et al.  Quantifying I/O and Communication Traffic Interference on Dragonfly Networks Equipped with Burst Buffers , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[35]  Sarala Arunagiri,et al.  Modeling and Analysis of Checkpoint I/O Operations , 2009, ASMTA.

[36]  Franck Cappello,et al.  Scheduling the I/O of HPC Applications Under Congestion , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[37]  Angkul Kongmunvattana,et al.  Efficient System-Level Remote Checkpointing Technique for BLCR , 2011, 2011 Eighth International Conference on Information Technology: New Generations.

[38]  John Bent,et al.  PLFS: a checkpoint filesystem for parallel applications , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[39]  Kai Li,et al.  CLIP: A Checkpointing Tool for Message Passing Parallel Programs , 1997, ACM/IEEE SC 1997 Conference (SC'97).