Towards Easy-to-Use Checkpointing of MPI Applications within CLUSTERIX

While there exist many kernel and user level libraries/systems which support checkpointing working processes and resuming their operations, it is still very difficult to provide an easy-to-use tool to assist checkpointing parallel applications. In this work, we aim at the development of an easy-to-use user-guided library to support checkpointing parallel MPI applications to be executed within the CLUSTERIX environment i.e. a collection of distributed HPC clusters. We propose a programmer-assisted approach with process state packing and unpacking at the code level for SPMD HPC applications. Although the library is in its early stage of development we present checkpoint/restart times and application execution (interrupted by checkpointing) times for the proposed approach compared to the same application linked with the ckpt user level library.