Design andImplementation of a Novel Message Logging Protocol for OpenFOAM

OpenFOAM is a free, open source CFD software package and has a large user base across most areas of engineering and science. Unfortunately, OpenFOAM itself faces severe reliability problems when running on the high performance computing platforms since the mean time between failures in such systems has become quite small. The existing fault-tolerance method in OpenFOAM typically tolerate fail-stop failures under the stop-and-rollback scheme, where even though there is only one processor failure, the whole system has to stop and roll back to the latest checkpoints, which indicates that the reliability has limited the scalability of parallel simulations on OpenFOAM. Inspired by the traditional sender-based message logging, we propose in this paper a novel message logging protocol which is seamlessly integrated in the framework of OpenFOAM. The proposed approach makes use of the snapshots of OpenFOAM as checkpoints and disables event logging mechanism completely due to the specific communication pattern of OpenFOAM. When a failure occurs during the execution, we do not stop the whole system, instead, we replace the failed process with the spawned substitution process and recover it by resending logged messages. We implement the protocol in Open MPI and evaluate it by molecular dynamics simulations on a subsystem of Tianhe-1A. Experimental results outline the advantage of our protocol on failure free performance and recovery time reduction. Keywords—message logging; fault tolerance; OpenFOAM