A Distributed Workflow Mapping Algorithm for Minimum End-to-End Delay under Fault-Tolerance Constraint

Many large-scale scientific applications feature distributed computing workflows of complex structures that must be executed and transferred in shared wide-area networks consisting of unreliable nodes and links. Mapping these computing workflows in such faulty network environments for optimal latency while ensuring certain fault tolerance is crucial to the success of eScience that requires both performance and reliability. We construct analytical cost models and formulate workflow mapping as an optimization problem under failure rate constraint. We propose a distributed heuristic mapping solution based on recursive critical path to achieve minimum end-to-end delay and satisfy a pre-specified overall failure rate for a guaranteed level of fault tolerance. The performance superiority of the proposed mapping solution is illustrated by extensive simulation-based comparisons with existing mapping algorithms.

[1]  George Papageorgiou,et al.  Scheduling Dags to Minimize Time and Communication , 1988, AWOC.

[2]  Rajkumar Buyya,et al.  A Dynamic Critical Path Algorithm for Scheduling Scientific Workflow Applications on Global Grids , 2007, Third IEEE International Conference on e-Science and Grid Computing (e-Science 2007).

[3]  Chase Qishi Wu,et al.  Supporting Distributed Application Workflows in Heterogeneous Computing Environments , 2008, 2008 14th IEEE International Conference on Parallel and Distributed Systems.

[4]  Chase Qishi Wu,et al.  Optimizing network performance of computing pipelines in distributed environments , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[5]  C. Siva Ram Murthy,et al.  A state-space search approach for optimizing reliability and cost of execution in distributed sensor networks , 2009, J. Parallel Distributed Comput..

[6]  Ian J. Taylor,et al.  Workflows and e-Science: An overview of workflow system features and capabilities , 2009, Future Gener. Comput. Syst..

[7]  Salim Hariri,et al.  Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing , 2002, IEEE Trans. Parallel Distributed Syst..

[8]  Umakishore Ramachandran,et al.  Streamline: a scheduling heuristic for streaming applications on the grid , 2006, Electronic Imaging.

[9]  Edward A. Lee,et al.  A Compile-Time Scheduling Heuristic for Interconnection-Constrained Heterogeneous Processor Architectures , 1993, IEEE Trans. Parallel Distributed Syst..

[10]  Ladislau Bölöni,et al.  A Comparison of Eleven Static Heuristics for Mapping a Class of Independent Tasks onto Heterogeneous Distributed Computing Systems , 2001, J. Parallel Distributed Comput..

[11]  Rajkumar Buyya,et al.  A Decentralized and Cooperative Workflow Scheduling Algorithm , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[12]  Füsun Özgüner,et al.  Dynamic, competitive scheduling of multiple DAGs in a distributed heterogeneous environment , 1998, Proceedings Seventh Heterogeneous Computing Workshop (HCW'98).

[13]  Dharma P. Agrawal,et al.  Improving scheduling of tasks in a heterogeneous environment , 2004, IEEE Transactions on Parallel and Distributed Systems.

[14]  C. Siva Ram Murthy,et al.  Task Allocation Algorithms for Maximizing Reliability of Distributed Computing Systems , 1997, IEEE Trans. Computers.

[15]  Yves Robert,et al.  Optimizing the Latency of Streaming Applications under Throughput and Reliability Constraints , 2009, ICPP.

[16]  Yves Robert,et al.  Mapping pipeline skeletons onto heterogeneous platforms , 2007, J. Parallel Distributed Comput..

[17]  Jeffrey D. Ullman,et al.  NP-Complete Scheduling Problems , 1975, J. Comput. Syst. Sci..

[18]  Ishfaq Ahmad,et al.  Dynamic Critical-Path Scheduling: An Effective Technique for Allocating Task Graphs to Multiprocessors , 1996, IEEE Trans. Parallel Distributed Syst..

[19]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[20]  J.-P. Wang,et al.  Task Allocation for Maximizing Reliability of Distributed Computer Systems , 1992, IEEE Trans. Computers.

[21]  Atakan Dogan,et al.  Matching and Scheduling Algorithms for Minimizing Execution Time and Failure Probability of Applications in Heterogeneous Computing , 2002, IEEE Trans. Parallel Distributed Syst..