A Three-Phases Byzantine Fault Tolerance Mechanism for HLA-Based Simulation

A large scale HLA-based simulation (federation) is composed of a large number of simulation components (federates), which may be developed by different participants and executed at different locations. Byzantine failures, caused by malicious attacks and software/hardware bugs, might happen to federates and propagate in the federation execution. In this paper, a three-phases (i.e., failure detection, failure location, and failure recovery) Byzantine Fault Tolerance (BFT) mechanism is proposed based on the decoupled federate architecture. By combining the replication, check pointing and message logging techniques, some redundant executions of federate replicas are avoided. The BFT mechanism is implemented using both Barrier and No-Barrier federate replication structures. Protocols are also developed to remove the epidemic effect caused by Byzantine failures. As the experiment results show, the BFT mechanism using No-Barrier replication outperforms that using Barrier replication significantly in the case that federate replicas have different runtime performance.

[1]  R.M. Fujimoto,et al.  Parallel and distributed simulation systems , 2001, Proceeding of the 2001 Winter Simulation Conference (Cat. No.01CH37304).

[2]  Divyakant Agrawal,et al.  Replicated objects in time warp simulations , 1992, WSC '92.

[3]  Ian T. Foster,et al.  The anatomy of the grid: enabling scalable virtual organizations , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[4]  Stephen John Turner,et al.  A replication structure for efficient and fault-tolerant parallel and distributed simulations , 2010, SpringSim.

[5]  Stephen John Turner,et al.  Improving performance by replicating simulations with alternative synchronization approaches , 2008, 2008 Winter Simulation Conference.

[6]  Stephen John Turner,et al.  Federate Fault Tolerance in HLA-Based Simulation , 2010, 2010 IEEE Workshop on Principles of Advanced and Distributed Simulation.

[7]  Xiaojun Shen,et al.  Design and Implementation of Haptic Tele-mentoring over the Internet , 2007 .

[8]  Liuba Shrira,et al.  HQ replication: a hybrid quorum protocol for byzantine fault tolerance , 2006, OSDI '06.

[9]  Miguel Oom Temudo de Castro,et al.  Practical Byzantine fault tolerance , 1999, OSDI '99.

[10]  Stephen John Turner,et al.  A decoupled federate architecture for high level architecture-based distributed simulation , 2008, J. Parallel Distributed Comput..

[11]  Bojan Groselj,et al.  Fault-tolerant distributed simulation , 1991, 1991 Winter Simulation Conference Proceedings..

[12]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[13]  Wentong Cai,et al.  Federate Migration in a Service Oriented HLA RTI , 2007, 11th IEEE International Symposium on Distributed Simulation and Real-Time Applications (DS-RT'07).

[14]  Rob Aspin,et al.  A Lightweight Heuristic-based Mechanism for Collecting Committed Consistent Global States in Optimistic Simulation , 2007, 11th IEEE International Symposium on Distributed Simulation and Real-Time Applications (DS-RT'07).

[15]  N. M. Steiger,et al.  A FRAMEWORK FOR FAULT-TOLERANCE IN HLA-BASED DISTRIBUTED SIMULATIONS , 2005 .

[16]  Stephen John Turner,et al.  A Framework for Robust HLA-based Distributed Simulations , 2006, 20th Workshop on Principles of Advanced and Distributed Simulation (PADS'06).

[17]  Stephen John Turner,et al.  A Service Oriented HLA RTI on the Grid , 2007, IEEE International Conference on Web Services (ICWS 2007).

[18]  Tobias Kiesling,et al.  Fault-Tolerant Distributed Simulation : A Position Paper , 2003 .

[19]  Ramakrishna Kotla,et al.  Zyzzyva , 2007, SOSP.

[20]  Francesco Quaglia Software Diversity-Based Active Replication as an Approach for Enhancing the Performance of Advanced Simulation Systems , 2007, Int. J. Found. Comput. Sci..

[21]  Roy Friedman,et al.  A replication- and checkpoint-based approach for anomaly-based intrusion detection and recovery , 2005, 25th IEEE International Conference on Distributed Computing Systems Workshops.

[22]  Miguel Castro,et al.  Practical byzantine fault tolerance and proactive recovery , 2002, TOCS.