Applying GVR to Molecular Dynamics : Enabling Resilience for Scientific Computations

In future exascale systems [1]-[4], resilience is a major concern. Molecular dynamics codes are an important computational method in a wide variety of areas of biology, chemistry, and physics. We applied the GVR (global view resilience) library to the ddcMD (domain decomposition molecular dynamics) code, both to explore application resilience challenges and evaluate the potential for GVR to broaden and simplify application resilience. Following the ddcMD code changes made to tolerate hardware unrecoverable L1 cache parity errors [17], we replicated these recovery capabilities with only adding 310 lines of GVR library calls to original 10,935 lines of source code. Our next step was to use this base to explore a range of application-specific error detection and recovery schemes that generalize the classes of errors that can be detected and recovered without application interruption. This broader class of errors includes general memory system errors (L2, L3, DRAM, bus, controller, etc), hardware computation errors, communication errors, software bugs, and others. The error checks are conveniently expressed in the application source code in terms of application data structures, and enable flexible, application-controlled recovery from these errors. We find that GVR enables convenient broadening of error coverage and resilience. To evaluate the capabilities of error detection schemes, we performed error injection experiments. The results show that application-specific error detection schemes can detect certain magnitudes of errors, but leave some errors silent. Our GVR provides opportunities to recover from silent errors.