Adapting grid applications to safety using fault-tolerant methods: Design, implementation and evaluations

Grid applications have been prone to encountering problems such as failures or malicious attacks during execution in recent years, due to their distributed and large-scale features. The application itself, however, has limited power to address these problems. This paper presents the design, implementation, and evaluation of an adaptive framework- Dynasa, which strives to handle security problems using adaptive fault-tolerance (i.e., checkpointing and replication) during the execution of applications according to the status of the Grid environments. We evaluate our adaptive framework experimentally using the Grid5000 testbed and the experimental results have demonstrated that Dynasa enables the application itself to handle the security problems efficiently. The starting of the adaptive component is less than 1 s and the adaptive action is less than 0.1 s with the checkpoint interval of 20 s. Compared with non-adaptive method, experimental results demonstrate that Dynasa achieves better performance in terms of execution time, network bandwidth consumed, and CPU load, resulting in up to a 50% lower overhead.

[1]  Chun-Ying Huang,et al.  Mitigating Active Attacks Towards Client Networks Using the Bitmap Filter , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[2]  Thomas Hérault,et al.  Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI Protocols , 2008, Future Gener. Comput. Syst..

[3]  Miguel Castro,et al.  Practical byzantine fault tolerance and proactive recovery , 2002, TOCS.

[4]  F. Cappello,et al.  Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[5]  Priya Narasimhan,et al.  Experiences, strategies, and challenges in building fault-tolerant CORBA systems , 2004, IEEE Transactions on Computers.

[6]  Soonwook Hwang,et al.  Grid workflow: a flexible failure handling framework for the grid , 2003, High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on.

[7]  Kang G. Shin,et al.  Hop-count filtering: an effective defense against spoofed DDoS traffic , 2003, CCS '03.

[8]  Russ Miller,et al.  Grid-enabled virtual organization based dynamic firewall , 2004, Fifth IEEE/ACM International Workshop on Grid Computing.

[9]  Yair Amir,et al.  Enhancing Distributed Systems with Mechanisms to Cope with Malicious Clients , 2005 .

[10]  Leon Gommans,et al.  Web services and grid security vulnerabilities and threats analysis and model , 2005, The 6th IEEE/ACM International Workshop on Grid Computing, 2005..

[11]  Tudor Dumitras,et al.  Architecting and Implementing Versatile Dependability , 2004, WADS.

[12]  Thomas Hérault,et al.  Computing on large-scale distributed systems: XtremWeb architecture, programming models, security, tests and convergence with grid , 2005, Future Gener. Comput. Syst..

[13]  Raimundo José de Araújo Macêdo,et al.  An Adaptive Programming Model for Fault-Tolerant Distributed Computing , 2007, IEEE Transactions on Dependable and Secure Computing.

[14]  Fred B. Schneider,et al.  Implementing trustworthy services using replicated state machines , 2005, IEEE Security & Privacy Magazine.

[15]  William H. Sanders,et al.  Model-based evaluation: from dependability to security , 2004, IEEE Transactions on Dependable and Secure Computing.

[16]  Jason Duell,et al.  The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..

[17]  André Schiper,et al.  Comparing Atomic Broadcast Algorithms in High Latency Networks , 2006 .

[18]  Vikram S. Adve,et al.  Program Control Language: a programming language for adaptive distributed applications , 2003, J. Parallel Distributed Comput..

[19]  John Lane,et al.  Customizable Fault Tolerance forWide-Area Replication , 2007, 2007 26th IEEE International Symposium on Reliable Distributed Systems (SRDS 2007).

[20]  Thomas Hérault,et al.  MPICH-V Project: A Multiprotocol Automatic Fault-Tolerant MPI , 2006, Int. J. High Perform. Comput. Appl..

[21]  Francine Berman,et al.  Adaptive Computing on the Grid Using AppLeS , 2003, IEEE Trans. Parallel Distributed Syst..

[22]  Michael K. Reiter,et al.  Defending against denial-of-service attacks with puzzle auctions , 2003, 2003 Symposium on Security and Privacy, 2003..

[23]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[24]  Franck Cappello,et al.  Grid'5000: a large scale, reconfigurable, controlable and monitorable Grid platform , 2005 .

[25]  Luis F. G. Sarmenta,et al.  Sabotage-tolerance mechanisms for volunteer computing systems , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[26]  Eduardo Huedo,et al.  A framework for adaptive execution in grids , 2004, Softw. Pract. Exp..

[27]  J. Duell The design and implementation of Berkeley Lab's linux checkpoint/restart , 2005 .

[28]  Franck Cappello,et al.  Grid'5000: a large scale and highly reconfigurable grid experimental testbed , 2005, The 6th IEEE/ACM International Workshop on Grid Computing, 2005..

[29]  Kang G. Shin,et al.  Hop-Count Filtering : An Effective Defense Against Spoofed Traffic , 2003 .

[30]  Rogério de Lemos,et al.  Architecting dependable systems , 2003, J. Syst. Softw..

[31]  Jean-Louis Pazat,et al.  A Framework for Dynamic Adaptation of Parallel Components , 2005, PARCO.

[32]  Nitin H. Vaidya,et al.  Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme , 1997, IEEE Trans. Computers.

[33]  I. Foster,et al.  Service-Oriented Science , 2005, Science.

[34]  Miguel Correia,et al.  Intrusion-Tolerant Architectures: Concepts and Design , 2002, WADS.

[35]  Minyi Guo,et al.  Process migration for MPI applications based on coordinated checkpoint , 2005, 11th International Conference on Parallel and Distributed Systems (ICPADS'05).