AI-Ckpt: leveraging memory access patterns for adaptive asynchronous incremental checkpointing

With increasing scale and complexity of supercomputing and cloud computing architectures, faults are becoming a frequent occurrence, which makes reliability a difficult challenge. Although for some applications it is enough to restart failed tasks, there is a large class of applications where tasks run for a long time or are tightly coupled, thus making a restart from scratch unfeasible. Checkpoint-Restart (CR), the main method to survive failures for such applications faces additional challenges in this context: not only does it need to minimize the performance overhead on the application due to checkpointing, but it also needs to operate with scarce resources. Given the iterative nature of the targeted applications, we launch the assumption that first-time writes to memory during asynchronous checkpointing generate the same kind of interference as they did in past iterations. Based on this assumption, we propose novel asynchronous checkpointing approach that leverages both current and past access pattern trends in order to optimize the order in which memory pages are flushed to stable storage. Large scale experiments show up to 60% improvement when compared to state-of-art checkpointing approaches, all this achievable with an extra memory requirement of less than 5% of the total application memory.

[1]  Rolf Riesen,et al.  libhashckpt: Hash-Based Incremental Checkpointing Using GPU's , 2011, EuroMPI.

[2]  Song Jiang,et al.  Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[3]  Andrew Warfield,et al.  SecondSite: disaster tolerance as a service , 2012, VEE '12.

[4]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[5]  John T. Daly,et al.  Application monitoring and checkpointing in HPC: looking towards exascale systems , 2012, ACM-SE '12.

[6]  Franck Cappello,et al.  A hybrid local storage transfer scheme for live migration of I/O intensive workloads , 2012, HPDC '12.

[7]  Franck Cappello,et al.  FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[8]  Torsten Hoefler,et al.  Characterizing the Influence of System Noise on Large-Scale Applications by Simulation , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  Franck Cappello,et al.  Scalable Reed-Solomon-Based Reliable Local Storage for HPC Applications on IaaS Clouds , 2012, Euro-Par.

[10]  Chao Wang,et al.  Hybrid Checkpointing for MPI Jobs in HPC Environments , 2010, 2010 IEEE 16th International Conference on Parallel and Distributed Systems.

[11]  Franck Cappello,et al.  BlobCR: Efficient checkpoint-restart for HPC applications on IaaS clouds using virtual disk image snapshots , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[12]  Brendan Tangney,et al.  Scrabble-a distributed application with an emphasis on continuity , 1990, Softw. Eng. J..

[13]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  Peter J. Denning,et al.  Working Sets Past and Present , 1980, IEEE Transactions on Software Engineering.

[15]  D. Manivannan,et al.  A quasi-synchronous checkpointing algorithm that prevents contention for stable storage , 2008, Inf. Sci..

[16]  Yuan Xie,et al.  Hybrid checkpointing using emerging nonvolatile memories for future exascale systems , 2011, TACO.

[17]  Franck Cappello,et al.  Damaris: How to Efficiently Leverage Multicore Parallelism to Achieve Scalable, Jitter-free I/O , 2012, 2012 IEEE International Conference on Cluster Computing.

[18]  Andrew Warfield,et al.  Live migration of virtual machines , 2005, NSDI.

[19]  George H. Bryan,et al.  The Maximum Intensity of Tropical Cyclones in Axisymmetric Numerical Model Simulations , 2009 .

[20]  Jason Evans April A Scalable Concurrent malloc(3) Implementation for FreeBSD , 2006 .

[21]  Khaled Z. Ibrahim,et al.  Optimized pre-copy live migration for memory intensive applications , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[22]  Bogdan Nicolae,et al.  Towards Scalable Checkpoint Restart: A Collective Inline Memory Contents Deduplication Proposal , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[23]  Bogdan Nicolae,et al.  On the Benefits of Transparent Compression for Cost-Effective Cloud Data Storage , 2011, Trans. Large Scale Data Knowl. Centered Syst..

[24]  Rolf Riesen,et al.  Transparent Redundant Computing with MPI , 2010, EuroMPI.

[25]  Frank Mueller,et al.  Comparing different approaches for Incremental Checkpointing : The Showdown , 2011 .

[26]  Robert B. Ross,et al.  PVFS: A Parallel File System for Linux Clusters , 2000, Annual Linux Showcase & Conference.