Towards Scalable Checkpoint Restart: A Collective Inline Memory Contents Deduplication Proposal

With increasing scale and complexity of supercomputing and cloud computing architectures, faults are becoming a frequent occurrence. For a large class of applications that run for a long time and are tightly coupled, Checkpoint-Restart (CR) is the only feasible method to survive failures. However, exploding checkpoint sizes that need to be dumped to storage pose a major scalability challenge, prompting the need to reduce the amount of checkpointing data. This paper contributes with a novel collective memory contents deduplication scheme that attempts to identify and eliminate duplicate memory pages before they are saved to storage. Unlike previous approaches that concentrate on the checkpoints of the same process, our approach identifies duplicate memory pages shared by different processes (regardless whether on the same or different node). We show both how to achieve such a global deduplication in a scalable fashion and how to leverage it effectively to optimize the data layout in such way that it minimizes I/O bottlenecks. Large scale experiments show significant reduction of storage space consumption and performance overhead compared to several state-of-art approaches, both in synthetic benchmarks and for a real life high performance computing application.

[1]  Robert B. Ross,et al.  PVFS: A Parallel File System for Linux Clusters , 2000, Annual Linux Showcase & Conference.

[2]  Margaret H. Wright,et al.  The opportunities and challenges of exascale computing , 2010 .

[3]  Franck Cappello,et al.  BlobCR: Efficient checkpoint-restart for HPC applications on IaaS clouds using virtual disk image snapshots , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[4]  Gabriel Antoniu,et al.  BlobSeer: Next-generation data management for large scale infrastructures , 2011, J. Parallel Distributed Comput..

[5]  Xiaofang Zhao,et al.  Performance analysis and optimization of MPI collective operations on multi-core clusters , 2009, The Journal of Supercomputing.

[6]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[7]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[9]  Yuan Xie,et al.  Hybrid checkpointing using emerging nonvolatile memories for future exascale systems , 2011, TACO.

[10]  Franck Cappello,et al.  Damaris: How to Efficiently Leverage Multicore Parallelism to Achieve Scalable, Jitter-free I/O , 2012, 2012 IEEE International Conference on Cluster Computing.

[11]  Sameer Kumar,et al.  Collective algorithms for sub-communicators , 2012, ICS '12.

[12]  Franck Cappello,et al.  Scalable Reed-Solomon-Based Reliable Local Storage for HPC Applications on IaaS Clouds , 2012, Euro-Par.

[13]  Michal Kaczmarczyk,et al.  HYDRAstor: A Scalable Secondary Storage , 2009, FAST.

[14]  Song Jiang,et al.  Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[15]  John T. Daly,et al.  Application monitoring and checkpointing in HPC: looking towards exascale systems , 2012, ACM-SE '12.

[16]  Franck Cappello,et al.  FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[17]  Lorenzo Alvisi,et al.  Reasons for a pessimistic or optimistic message logging protocol in MPI uncoordinated failure, recovery , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[18]  André Brinkmann,et al.  A study on data deduplication in HPC storage systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[19]  Frank Mueller,et al.  Comparing different approaches for Incremental Checkpointing : The Showdown , 2011 .

[20]  George H. Bryan,et al.  The Maximum Intensity of Tropical Cyclones in Axisymmetric Numerical Model Simulations , 2009 .

[21]  Jason Evans April A Scalable Concurrent malloc(3) Implementation for FreeBSD , 2006 .

[22]  Kurt B. Ferreira,et al.  On the Viability of Checkpoint Compression for Extreme Scale Fault Tolerance , 2011, Euro-Par Workshops.

[23]  Franck Cappello,et al.  Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[24]  David Brink,et al.  A (probably) exact solution to the Birthday Problem , 2012 .

[25]  Rolf Riesen,et al.  libhashckpt: Hash-Based Incremental Checkpointing Using GPU's , 2011, EuroMPI.