OpenFOAM is a free, open source CFD software package and has a large user base across most areas of engineering and science. Unfortunately, OpenFOAM itself faces severe reliability problems when running on the high performance computing platforms since the mean time between failures in such systems has become quite small. The existing fault-tolerance method in OpenFOAM typically tolerate fail-stop failures under the stop-and-rollback scheme, where even though there is only one processor failure, the whole system has to stop and roll back to the latest checkpoints, which indicates that the reliability has limited the scalability of parallel simulations on OpenFOAM. Inspired by the traditional sender-based message logging, we propose in this paper a novel message logging protocol which is seamlessly integrated in the framework of OpenFOAM. The proposed approach makes use of the snapshots of OpenFOAM as checkpoints and disables event logging mechanism completely due to the specific communication pattern of OpenFOAM. When a failure occurs during the execution, we do not stop the whole system, instead, we replace the failed process with the spawned substitution process and recover it by resending logged messages. We implement the protocol in Open MPI and evaluate it by molecular dynamics simulations on a subsystem of Tianhe-1A. Experimental results outline the advantage of our protocol on failure free performance and recovery time reduction. Keywords—message logging; fault tolerance; OpenFOAM
[1]
Yun Zhou,et al.
The Reliability Wall for Exascale Supercomputing
,
2012,
IEEE Transactions on Computers.
[2]
Ieee Xiang,et al.
The TianHe-1A Supercomputer: Its Hardware and Software
,
2011
.
[3]
Aleksandar Jemcov,et al.
OpenFOAM: A C++ Library for Complex Physics Simulations
,
2007
.
[4]
Xuejun Yang,et al.
WBC-ALC: A Weak Blocking Coordinated Application-Level Checkpointing for MPI Programs
,
2012,
IEICE Trans. Inf. Syst..
[5]
Laxmikant V. Kalé,et al.
A Fault Tolerance Protocol with Fast Fault Recovery
,
2007,
2007 IEEE International Parallel and Distributed Processing Symposium.
[6]
Peter Stephan,et al.
CFD Simulation of Boiling Flows Using the Volume-of-Fluid Method within OpenFOAM
,
2009
.
[7]
Thomas Hérault,et al.
Correlated Set Coordination in Fault Tolerant Message Logging Protocols
,
2011,
Euro-Par.
[8]
Laxmikant V. Kalé,et al.
A message-logging protocol for multicore systems
,
2012,
IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012).
[9]
Hrvoje Jasak,et al.
Development of a Generalized Grid Mesh Interface for Turbomachinery simulations with OpenFOAM
,
2008
.