Tuning of the Checkpointing and Communication Library for optimistic simulation on Myrinet based NOWs

Recently a Checkpointing and Communication Library (CCL) for optimistic simulation on Myrinet based network of workstations (NOWs) has been presented. CCL offloads checkpoint operations from the CPU by charging them to a programmable DMA engine on the Myrinet network card. CCL includes also functionalities for freezing the simulation application on demand, which can be used for data consistency maintenance (for example when a state buffer needs to be accessed for further modifications while a DMA based checkpoint operation involving it is still in progress). Programming the DMA to perform a checkpoint operation by transferring large data blocks in a single burst allows the latency of any checkpoint operation to be kept low. This reduces the probability for application freezing to really occur On the other hand, transferring large data blocks in a single burst might cause negative interference on communication since that DMA (and other circuitry) cannot be used for communication functionalities until the currently executed data transfer is not yet completed. In this paper we present a detailed identification of the effects of the burst length, from which we outline a set of relevant phenomena to take into account in order to determine a compile time suited value for the burst length itself. We also report measures quantifying these phenomena for the case of a PC cluster. Actually, the data indicate that communication functionalities do not suffer from the use of non-minimal burst lengths for checkpoint operations, thus pointing out how, if well tuned, CCL provides highly effective, CPU off-loaded, checkpointing functionalities.

[1]  Steven F. Bellenot State skipping performance with the time warp operating system , 1991 .

[2]  Philip A. Wilsey,et al.  Comparative analysis of periodic state saving techniques in time warp simulators , 1995, PADS.

[3]  A. Chien,et al.  High Performance Messaging on Workstations: Illinois Fast Messages (FM) for Myrinet , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[4]  Robert Rönngren,et al.  Event sensitive state saving in time warp parallel discrete event simulations , 1996, WSC.

[5]  Christopher D. Carothers,et al.  Effect of communication overheads on Time Warp performance: an experimental study , 1994, PADS '94.

[6]  Jeff S. Steinman,et al.  Incremental State Saving in Speedes Using C++ , 1993, Proceedings of 1993 Winter Simulation Conference - (WSC '93).

[7]  Rassul Ayani,et al.  Adaptive checkpointing in Time Warp , 1994, PADS '94.

[8]  David R. Jefferson,et al.  Virtual time , 1985, ICPP.

[9]  Francesco Quaglia A Cost Model for Selecting Checkpoint Positions in Time Warp Parallel Simulation , 2001, IEEE Trans. Parallel Distributed Syst..

[10]  Herbert Bauer,et al.  Reducing Rollback Overhead In Time-warp Based Distributed Simulation With Optimized Incremental State Saving , 1993, [1993] Proceedings 26th Annual Simulation Symposium.

[11]  Yi-Bing Lin,et al.  Selecting the checkpoint interval in time warp simulation , 1993, PADS '93.

[12]  John G. Cleary,et al.  An external state management system for optimistic parallel simulation , 1993, WSC '93.

[13]  Francesco Quaglia Combining periodic and probabilistic checkpointing in optimistic simulation , 1999, Proceedings Thirteenth Workshop on Parallel and Distributed Simulation. PADS 99. (Cat. No.PR00155).

[14]  Francesco Quaglia,et al.  Semi-asynchronous checkpointing for optimistic simulation on a Myrinet based NOW , 2001, Proceedings 15th Workshop on Parallel and Distributed Simulation.