Maximizing Service Reliability in Distributed Computing Systems with Random Node Failures: Theory and Implementation

In distributed computing systems (DCSs) where server nodes can fail permanently with nonzero probability, the system performance can be assessed by means of the service reliability, defined as the probability of serving all the tasks queued in the DCS before all the nodes fail. This paper presents a rigorous probabilistic framework to analytically characterize the service reliability of a DCS in the presence of communication uncertainties and stochastic topological changes due to node deletions. The framework considers a system composed of heterogeneous nodes with stochastic service and failure times and a communication network imposing random tangible delays. The framework also permits arbitrarily specified, distributed load-balancing actions to be taken by the individual nodes in order to improve the service reliability. The presented analysis is based upon a novel use of the concept of stochastic regeneration, which is exploited to derive a system of difference-differential equations characterizing the service reliability. The theory is further utilized to optimize certain load-balancing policies for maximal service reliability; the optimization is carried out by means of an algorithm that scales linearly with the number of nodes in the system. The analytical model is validated using both Monte Carlo simulations and experimental data collected from a DCS testbed.

[1]  Yskandar Hamam,et al.  Reliability oriented task allocation in heterogeneous distributed computing systems , 2004, Proceedings. ISCC 2004. Ninth International Symposium on Computers And Communications (IEEE Cat. No.04TH8769).

[2]  P. Sparaggis,et al.  Minimizing response times and queue lengths in systems of parallel queues , 1999 .

[3]  Leandros Tassiulas,et al.  Stability properties of constrained queueing systems and scheduling policies for maximum throughput in multihop radio networks , 1992 .

[4]  Thomas L. Casavant,et al.  A Taxonomy of Scheduling in General-Purpose Distributed Computing Systems , 1988, IEEE Trans. Software Eng..

[5]  Kishor S. Trivedi,et al.  Performance and Reliability of Tree-Structured Grid Services Considering Data Dependence and Failure Correlation , 2007, IEEE Transactions on Computers.

[6]  Niraj K. Jha,et al.  Safety and Reliability Driven Task Allocation in Distributed Systems , 1999, IEEE Trans. Parallel Distributed Syst..

[7]  Sagar Dhakal,et al.  A Regeneration-Based Approach for Resource Allocation in Cooperative Distributed Systems , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[8]  Yskandar Hamam,et al.  Assignment of program modules to processors: A simulated annealing approach , 2000, Eur. J. Oper. Res..

[9]  Chaouki T. Abdallah,et al.  Dynamic Time Delay Models for Load Balancing. Part II: A Stochastic Analysis of the Effect of Delay Uncertainty , 2004 .

[10]  Sagar Dhakal,et al.  Dynamical discrete-time load balancing in distributed systems in the presence of time delays , 2003, 42nd IEEE International Conference on Decision and Control (IEEE Cat. No.03CH37475).

[11]  Marcel F. Neuts,et al.  A Markovian Queue with N Servers Subject to Breakdowns and Repairs , 1979 .

[12]  Samuel T. Chanson,et al.  Hydrodynamic Load Balancing , 1999, IEEE Trans. Parallel Distributed Syst..

[13]  Vladimir Cherkassky,et al.  Task allocation and reallocation for fault tolerance in multicomputer systems , 1994 .

[14]  Gregory Levitin,et al.  Optimal Resource Allocation for Maximizing Performance and Reliability in Tree-Structured Grid Services , 2007, IEEE Transactions on Reliability.

[15]  V. Ravi,et al.  Nonequilibrium simulated-annealing algorithm applied to reliability optimization of complex systems , 1997 .

[16]  Isi Mitrani,et al.  Empirical and Analytical Evaluation of Systems with Multiple Unreliable Servers , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[17]  S. M. Shatz,et al.  Models and algorithms for reliability-oriented task-allocation in redundant distributed-computer systems , 1989 .

[18]  Sagar Dhakal,et al.  Decentralized Load Balancing for Improving Reliability in Heterogeneous Distributed Systems , 2009, 2009 International Conference on Parallel Processing Workshops.

[19]  George Cybenko,et al.  Dynamic Load Balancing for Distributed Memory Multiprocessors , 1989, J. Parallel Distributed Comput..

[20]  K. Gu,et al.  Advances in Time-Delay Systems , 2009 .

[21]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[22]  Sagar Dhakal,et al.  Load balancing in the presence of random node failure and recovery , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[23]  Zhiling Lan,et al.  Dynamic load balancing for structured adaptive mesh refinement applications , 2001, International Conference on Parallel Processing, 2001..

[24]  Paul Z. Kolano A resource manager for optimal resource selection and fault tolerance service in Grids , 2004, IEEE International Symposium on Cluster Computing and the Grid, 2004. CCGrid 2004..

[25]  Anthony A. Maciejewski,et al.  Stochastic robustness metric and its use for static resource allocations , 2008, J. Parallel Distributed Comput..

[26]  Chantal Balayer,et al.  Modeling Load Balancing inside Groups using , 1997 .

[27]  A. Brandt,et al.  On a Two-Queue Priority System with Impatience and its Application to a Call Center* , 1999 .

[28]  Anil Kumar Tripathi,et al.  Maximizing reliability of distributed computing system with task allocation using simple genetic algorithm , 2001, J. Syst. Archit..

[29]  Lester Lipsky,et al.  The effect of different failure recovery procedures on the distribution of task completion times , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[30]  Satish K. Tripathi,et al.  On the availability of a distributed computer system with failing components , 1985, SIGMETRICS 1985.

[31]  Eytan Modiano,et al.  Dynamic power allocation and routing for time varying wireless networks , 2003, IEEE INFOCOM 2003. Twenty-second Annual Joint Conference of the IEEE Computer and Communications Societies (IEEE Cat. No.03CH37428).

[32]  John C. S. Lui,et al.  Chained declustering: load balancing and robustness to skew and failures , 1992, [1992 Proceedings] Second International Workshop on Research Issues on Data Engineering: Transaction and Query Processing.

[33]  David A. Bader,et al.  Dynamic Load Balancing in Distributed Systems in the Presence of Delays: A Regeneration-Theory Approach , 2007, IEEE Transactions on Parallel and Distributed Systems.

[34]  R. K. McConnell,et al.  Load Balancing , 2021, Encyclopedia of Algorithms.

[35]  David Y. Burman,et al.  A Light-Traffic Theorem for Multi-Server Queues , 1983, Math. Oper. Res..

[36]  Bharadwaj Veeravalli,et al.  On the Design of Adaptive and Decentralized Load Balancing Algorithms with Load Estimation for Computational Grid Environments , 2007, IEEE Transactions on Parallel and Distributed Systems.

[37]  Yskandar Hamam,et al.  Task allocation for maximizing reliability of distributed systems: A simulated annealing approach , 2006, J. Parallel Distributed Comput..