A replication structure for efficient and fault-tolerant parallel and distributed simulations

Large scale parallel and distributed simulations (federations) are developed to study complex systems. Their executions are usually computationally intensive, involving a large number of simulation components (federates) which may be developed by different participants and executed at different locations. Hence, it is attractive to provide mechanisms which can accelerate the executions and tolerate the failures of federates. Previously, we have proposed a federate replication structure, which improves simulation performance by replicating federates with alternative synchronization approaches and automatically choosing the fastest replica to represent the federate in the federation execution. In this paper, we will extend the replication structure so that it keeps the advantages of performance enhancement in the presence of failures. Besides presenting the design and implementation details, we also report the experimental results to demonstrate that the extended replication structure can provide fault tolerance while maintaining performance advantages for simulation executions.

[1]  Gregor von Laszewski,et al.  A fault detection service for wide area distributed computations , 2004, Cluster Computing.

[2]  Vijay K. Garg,et al.  Fault-tolerant distributed simulation , 1998, Workshop on Parallel and Distributed Simulation.

[3]  Rob Aspin,et al.  A Lightweight Heuristic-based Mechanism for Collecting Committed Consistent Global States in Optimistic Simulation , 2007, 11th IEEE International Symposium on Distributed Simulation and Real-Time Applications (DS-RT'07).

[4]  Randal E. Bryant,et al.  SIMULATION OF PACKET COMMUNICATION ARCHITECTURE COMPUTER SYSTEMS , 1977 .

[5]  Johannes Lüthi,et al.  FT-RSS: A Flexible Framework for Fault Tolerant HLA Federations , 2004, International Conference on Computational Science.

[6]  K. Mani Chandy,et al.  Distributed Simulation: A Case Study in Design and Verification of Distributed Programs , 1979, IEEE Transactions on Software Engineering.

[7]  Paolo Romano,et al.  A Lightweight Heuristic-based Mechanism for Collecting Committed Consistent Global States in Optimistic Simulation , 2007 .

[8]  Stephen John Turner,et al.  Improving performance by replicating simulations with alternative synchronization approaches , 2008, 2008 Winter Simulation Conference.

[9]  Rassul Ayani,et al.  A framework for fault-tolerance in HLA-based distributed simulations , 2005, Proceedings of the Winter Simulation Conference, 2005..

[10]  Stephen John Turner,et al.  A Hybrid HLA Time Management Algorithm Based on Both Conditional and Unconditional Information , 2008, 2008 22nd Workshop on Principles of Advanced and Distributed Simulation.

[11]  Divyakant Agrawal,et al.  Recovering from Multiple Process Failures in the Time Warp Mechanism , 1992, IEEE Trans. Computers.

[12]  David R. Jefferson,et al.  Virtual time , 1985, ICPP.

[13]  Bojan Groselj,et al.  Fault-tolerant distributed simulation , 1991, 1991 Winter Simulation Conference Proceedings..

[14]  Richard M. Fujimoto,et al.  Parallel and Distribution Simulation Systems , 1999 .

[15]  Richard M. Fujimoto,et al.  Grand Challenges for Modeling and Simulation , 2002 .

[16]  R.M. Fujimoto,et al.  Parallel and distributed simulation systems , 2001, Proceeding of the 2001 Winter Simulation Conference (Cat. No.01CH37304).

[17]  Stephen John Turner,et al.  A Framework for Robust HLA-based Distributed Simulations , 2006, 20th Workshop on Principles of Advanced and Distributed Simulation (PADS'06).

[18]  Stephen John Turner,et al.  A Service Oriented HLA RTI on the Grid , 2007, IEEE International Conference on Web Services (ICWS 2007).

[19]  Tobias Kiesling,et al.  Fault-Tolerant Distributed Simulation : A Position Paper , 2003 .

[20]  Wentong Cai,et al.  Federate Migration in a Service Oriented HLA RTI , 2007, 11th IEEE International Symposium on Distributed Simulation and Real-Time Applications (DS-RT'07).

[21]  Johannes Lüthi,et al.  Concepts for dependable distributed discrete event simulation , 2000, ESM.

[22]  Ian T. Foster,et al.  The Anatomy of the Grid: Enabling Scalable Virtual Organizations , 2001, Int. J. High Perform. Comput. Appl..