Analyzing scheduling with transient failures

When using thousands of processors simultaneously, the application developer can no longer assume that the computing platform is failure free. The probability that one of the computing processors crashes drastically increases with the number of processors [9]. Safety issues not only occur in large scale computing platforms but also in some real-time embedded systems [2]. There exist a lot of safety models. One of the most popular has been proposed by Shatz [11] where the application is represented as a set of tasks to schedule on processors. Under this model, the faults are supposed to be transient (which means that the processors recover just after a failure), their occurrence are supposed to follow a Poisson's process and to be statistically independent. The main performance index related to safety is the reliability, i.e. the probability that the application completes successfully. Improving the reliability can be achieved by a smart allocation of tasks on the processors. However, such an allocation can not improve the reliability by more than one order of magnitude. To reach a better reliability while not worsening the makespan too much, replication should be used. Unfortunately, the optimization of both the e ciency and the reliability is a di cult problem. Several algorithms have been proposed but most results are heuristics which do not lead to theoretical guarantees [2, 5]. Three prior works give strict guarantees on obtained schedules. First, without allowing replication and under the fail-stop model (processors can crash and will never be operational

[1]  Yves Robert,et al.  Optimizing latency and reliability of pipeline workflow applications , 2007, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[2]  Emmanuel Jeannot,et al.  Bi-objective Approximation Scheme for Makespan and Reliability Optimization on Uniform Parallel Machines , 2008, Euro-Par.

[3]  Atakan Dogan,et al.  Matching and Scheduling Algorithms for Minimizing Execution Time and Failure Probability of Applications in Heterogeneous Computing , 2002, IEEE Trans. Parallel Distributed Syst..

[4]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[5]  Mihalis Yannakakis,et al.  On the approximability of trade-offs and optimal access of Web sources , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[6]  Jean-Charles Billaut,et al.  Multicriteria scheduling , 2005, Eur. J. Oper. Res..

[7]  Anand Sivasubramaniam,et al.  Fault-aware job scheduling for BlueGene/L systems , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[8]  Alain Girault,et al.  A bi-criteria scheduling heuristic for distributed embedded systems under reliability and real-time constraints , 2004, International Conference on Dependable Systems and Networks, 2004.

[9]  Ronald L. Graham,et al.  Bounds for certain multiprocessing anomalies , 1966 .

[10]  Emmanuel Jeannot,et al.  Bi-objective scheduling algorithms for optimizing makespan and reliability on heterogeneous systems , 2007, SPAA '07.

[11]  Evripidis Bampis,et al.  A FPTAS for Approximating the Unrelated Parallel Machines Scheduling Problem with Costs , 2001, ESA.

[12]  J.-P. Wang,et al.  Task Allocation for Maximizing Reliability of Distributed Computer Systems , 1992, IEEE Trans. Computers.