Supporting fault-tolerance for time-critical events in distributed environments

In this paper, we consider the problem of supporting fault tolerance for adaptive and time-critical applications in heterogeneous and unreliable grid computing environments. Our goal for this class of applications is to optimize a user-specified benefit function while meeting the time deadline. Our first contribution in this paper is a multi-objective optimization algorithm for scheduling the application onto the most efficient and reliable resources. In this way, the processing can achieve the maximum benefit while also maximizing the success-rate, which is the probability of finishing execution without failures. However, for the cases where failures do occur, we have developed a hybrid failure-recovery scheme to ensure that the application can complete within the pre-specified time interval. Our experimental results show that our scheduling algorithm can achieve better benefit when compared to several heuristics-based greedy scheduling algorithms, while still having a negligible overhead. Benefit is further improved when we apply the hybrid failure recovery scheme, and the success-rate becomes 100%.

[1]  Yves Robert,et al.  Fault tolerant scheduling of precedence task graphs on heterogeneous platforms , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[2]  Yves Sorel,et al.  Static Fault-Tolerant Real-Time Scheduling with "Pseudo-topological" Orders , 2004, FORMATS/FTRTFT.

[3]  Nguyen Duc Thai Fault-Tolerant Scheduling in Distributed Real-Time Systems , 2003, PPAM.

[4]  Abhishek Chandra,et al.  Adaptive Reputation-Based Scheduling on Unreliable Distributed Infrastructures , 2007, IEEE Transactions on Parallel and Distributed Systems.

[5]  Rajkumar Buyya,et al.  Scheduling scientific workflow applications with deadline and budget constraints using genetic algorithms , 2006, Sci. Program..

[6]  Mauro Birattari,et al.  Swarm Intelligence , 2012, Lecture Notes in Computer Science.

[7]  Nenad Medvidovic,et al.  A Bayesian Model for Predicting Reliability of Software Systems at the Architectural Level , 2007, QoSA.

[8]  Luigi Landini,et al.  Real-time multimodal medical image processing: a dynamic volume-rendering application , 1997, IEEE Transactions on Information Technology in Biomedicine.

[9]  Qian Zhu,et al.  An adaptive middleware for supporting time-critical event response , 2008, 2008 International Conference on Autonomic Computing.

[10]  Daniel A. Reed,et al.  Fault Tolerance and Recovery of Scientific Workflows on Computational Grids , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[11]  Domenico Talia,et al.  Peer-to-Peer Models for Resource Discovery on Grids , 2006 .

[12]  Han-Wei Shen,et al.  A Framework for Rendering Large Time-Varying Data Using Wavelet-Based Time-Space Partitioning (WTSP) Tree , 2004 .

[13]  Manoj Kumar Tiwari,et al.  Interactive Particle Swarm: A Pareto-Adaptive Metaheuristic to Multiobjective Optimization , 2008, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[14]  Rajkumar Buyya,et al.  GridSim: a toolkit for the modeling and simulation of distributed resource management and scheduling for Grid computing , 2002, Concurr. Comput. Pract. Exp..

[15]  Andrew A. Chien,et al.  Realistic Modeling and Svnthesis of Resources for Computational Grids , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[16]  Abhishek Chandra,et al.  Ridge: combining reliability and performance in open grid platforms , 2007, HPDC '07.

[17]  Cheng-Zhong Xu,et al.  Exploring event correlation for failure prediction in coalitions of clusters , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[18]  Hesham H. Ali,et al.  An evolutionary approach for real-time fault-tolerant multiprocessor scheduling , 2004 .

[19]  Shubhashis Sengupta,et al.  A Systematic Approach for Application Migration in a Grid Computing Environment , 2006, 2006 IEEE Asia-Pacific Conference on Services Computing (APSCC'06).

[20]  Andrew A. Chien,et al.  Resource Management for Rapid Application Turnaround on Enterprise Desktop Grids , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[21]  Gagan Agrawal,et al.  Supporting fault-tolerance for time-critical events in distributed environments , 2010 .

[22]  R. Buyya,et al.  A budget constrained scheduling of workflow applications on utility Grids using genetic algorithms , 2006, 2006 Workshop on Workflows in Support of Large-Scale Science.

[23]  R. K. Ursem Multi-objective Optimization using Evolutionary Algorithms , 2009 .

[24]  Carl Kesselman,et al.  A provisioning model and its comparison with best-effort for performance-cost optimization in grids , 2007, HPDC '07.

[25]  Chaoli Wang,et al.  Interactive Level-of-Detail Selection Using Image-Based Quality Metric for Large Volume Visualization , 2007, IEEE Transactions on Visualization and Computer Graphics.

[26]  C. Siva Ram Murthy,et al.  Improved task-allocation algorithms to maximize reliability of redundant distributed computing systems , 1995 .

[27]  Matti A. Hiltunen,et al.  Fault-tolerant grid services using primary-backup: feasibility and performance , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[28]  Alain Girault,et al.  A bi-criteria scheduling heuristic for distributed embedded systems under reliability and real-time constraints , 2004, International Conference on Dependable Systems and Networks, 2004.

[29]  Rajesh Raman,et al.  Policy driven heterogeneous resource co-allocation with Gangmatching , 2003, High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on.

[30]  Soonwook Hwang,et al.  A Flexible Framework for Fault Tolerance in the Grid , 2003, Journal of Grid Computing.

[31]  Giorgio C. Buttazzo,et al.  Optimal scheduling for fault-tolerant and firm real-time systems , 1998, Proceedings Fifth International Conference on Real-Time Computing Systems and Applications (Cat. No.98EX236).

[32]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[33]  Jemal H. Abawajy,et al.  Fault-tolerant scheduling policy for grid computing systems , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[34]  Qian Zhu,et al.  A resource allocation approach for supporting time-critical applications in grid environments , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[35]  Daniel Marques,et al.  Implementation and Evaluation of a Scalable Application-Level Checkpoint-Recovery Scheme for MPI Programs , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[36]  Chuang Liu,et al.  Design and evaluation of a resource selection framework for Grid applications , 2002, Proceedings 11th IEEE International Symposium on High Performance Distributed Computing.

[37]  Xiao Qin,et al.  A dynamic and reliability-driven scheduling algorithm for parallel real-time jobs executing on heterogeneous clusters , 2005, J. Parallel Distributed Comput..

[38]  Atakan Dogan,et al.  Biobjective Scheduling Algorithms for Execution Time?Reliability Trade-off in Heterogeneous Computing Systems , 2005, Comput. J..