Modeling and optimization of non-blocking checkpointing for optimistic simulation on myrinet clusters

Checkpointing and Communication Library (CCL) is a recently developed software implementing CPU offloaded checkpointing functionalities in support of optimistic parallel simulation on myrinet clusters. Specifically, CCL implements a non-blocking execution mode of memory-to-memory data copy associated with checkpoint operations, based on data transfer capabilities provided by a programmable DMA engine on board of myrinet network cards. Re-synchronization between CPU and DMA activities must sometimes be employed for several reasons, such as maintenance of data consistency, thus adding some overhead to (otherwise CPU cost-free) non-blocking checkpoint operations. In this paper we present a cost model for non-blocking checkpointing and derive a performance effective re-synchronization semantic which we call minimum cost re-synchronization <em>MC</em>. With this semantic, an occurrence of re-synchronization either commits an on-going DMA based checkpoint operation (causing suspension of CPU activities) or aborts the operation (with possible increase in the expected rollback cost due to a reduced amount of committed checkpoints) on the basis of a minimum overhead expectation evaluated through the cost model. We have implemented <em>MC</em> within CCL, and we also report experimental results demonstrating the performance benefits from this optimized re-synchronization semantic, in terms of increase in the execution speed, for a Personal Communication System (PCS) simulation application.

[1]  Francesco Quaglia,et al.  Nonblocking Checkpointing for Optimistic Parallel Simulation: Description and an Implementation , 2003, IEEE Trans. Parallel Distributed Syst..

[2]  Richard M. Fujimoto,et al.  Time Warp on a Shared Memory Multiprocessor , 1989, ICPP.

[3]  David R. Jefferson,et al.  Virtual time , 1985, ICPP.

[4]  Philip A. Wilsey,et al.  An analytical comparison of periodic checkpointing and incremental state saving , 1993, PADS '93.

[5]  Abraham Silberschatz,et al.  Operating System Concepts , 1983 .

[6]  Yi-Bing Lin,et al.  Selecting the checkpoint interval in time warp simulation , 1993, PADS '93.

[7]  Darrin West,et al.  Automatic incremental state saving , 1996, Workshop on Parallel and Distributed Simulation.

[8]  Christopher D. Carothers,et al.  ROSS: a high-performance, low memory, modular time warp system , 2000, Proceedings Fourteenth Workshop on Parallel and Distributed Simulation.

[9]  Bruno Ciciani,et al.  Tuning of the Checkpointing and Communication Library for optimistic simulation on Myrinet based NOWs , 2001, MASCOTS 2001, Proceedings Ninth International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[10]  S. Skold,et al.  Event sensitive state saving in time warp parallel discrete event simulations , 1996, Proceedings Winter Simulation Conference.

[11]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[12]  Jeff S. Steinman,et al.  Incremental State Saving in Speedes Using C++ , 1993, Proceedings of 1993 Winter Simulation Conference - (WSC '93).

[13]  Francesco Quaglia,et al.  Benefits from semi-asynchronous checkpointing for time warp simulations of a large state PCS model , 2001, Proceeding of the 2001 Winter Simulation Conference (Cat. No.01CH37304).

[14]  Adel Said Elmaghraby,et al.  An Analytical Model for Hybrid Checkpointing in Time Warp Distributed Simulation , 1998, IEEE Trans. Parallel Distributed Syst..

[15]  R. Fujimoto,et al.  A case study in simulating PCS networks using time warp , 1995, Proceedings 9th Workshop on Parallel and Distributed Simulation (ACM/IEEE).

[16]  David Bruce The treatment of state in optimistic systems , 1995, PADS.

[17]  Paul F. Reynolds,et al.  Implementation of reductions in support of PDES on a network of workstations , 1998, Workshop on Parallel and Distributed Simulation.

[18]  Peter B. Galvin,et al.  Operating System Concepts, 4th Ed. , 1993 .

[19]  R. M. Fujimoto,et al.  Parallel discrete event simulation , 1989, WSC '89.

[20]  Frederick Wieland Practical parallel simulation applied to aviation modeling , 2001, Proceedings 15th Workshop on Parallel and Distributed Simulation.

[21]  Sajal K. Das,et al.  Exploiting model independence for parallel PCS network simulation , 1999, Proceedings Thirteenth Workshop on Parallel and Distributed Simulation. PADS 99. (Cat. No.PR00155).

[22]  Alois Ferscha,et al.  Estimating rollback overhead for optimism control in Time Warp , 1995, Proceedings of Simulation Symposium.

[23]  Scott Pakin,et al.  High Performance Messaging on Workstations: Illinois Fast Messages (FM) for Myrinet , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[24]  Johan Montagnat,et al.  Transparent incremental state saving in time warp parallel discrete event simulation , 1996, Workshop on Parallel and Distributed Simulation.

[25]  Francesco Quaglia A Cost Model for Selecting Checkpoint Positions in Time Warp Parallel Simulation , 2001, IEEE Trans. Parallel Distributed Syst..

[26]  Wayne M. Loucks,et al.  Effects of the checkpoint interval on time and space in time warp , 1994, TOMC.

[27]  Philip A. Wilsey,et al.  Comparative analysis of periodic state saving techniques in time warp simulators , 1995, PADS.

[28]  William Stallings,et al.  Operating Systems: Internals and Design Principles , 1991 .

[29]  Jeffrey F. Naughton,et al.  Low-Latency, Concurrent Checkpointing for Parallel Programs , 1994, IEEE Trans. Parallel Distributed Syst..

[30]  Jeff S. Steinman Incremental state saving in SPEEDES using C++ , 1993, WSC '93.

[31]  Rassul Ayani,et al.  Adaptive checkpointing in Time Warp , 1994, PADS '94.

[32]  Willy Zwaenepoel,et al.  The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[33]  Christopher D. Carothers,et al.  Distributed simulation of large-scale PCS networks , 1994, Proceedings of International Workshop on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.