Scheduling linear chain streaming applications on heterogeneous systems with failures

In this paper, we study the problem of optimizing the throughput of streaming applications for heterogeneous platforms subject to failures. Applications are linear graphs of tasks (pipelines), with a type associated to each task. The challenge is to map each task onto one machine of a target platform, each machine having to be specialized to process only one task type, given that every machine is able to process all the types before being specialized in order to avoid costly setups. The objective is to maximize the throughput, i.e., the rate at which jobs can be processed when accounting for failures. Each instance can thus be performed by any machine specialized in its type and the workload of the system can be shared among a set of specialized machines. For identical machines, we prove that an optimal solution can be computed in polynomial time. However the problem becomes NP-hard when two machines may compute the same task type at different speeds. Several polynomial time heuristics are designed for the most realistic specialized settings. Simulation results assess their efficiency, showing that the best heuristics obtain a good throughput, much better than the throughput obtained with a random mapping. Moreover, the throughput is close to the optimal solution in the particular cases where the optimal throughput can be computed.

[1]  Christian Poellabauer,et al.  Analysis of a window-constrained scheduler for real-time and best-effort packet streams , 2000, Proceedings 21st IEEE Real-Time Systems Symposium.

[2]  Rajkumar Buyya,et al.  A Taxonomy of Workflow Management Systems for Grid Computing , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.

[3]  Yves Robert,et al.  Optimizing latency and reliability of pipeline workflow applications , 2007, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[4]  Yves Robert,et al.  A survey of pipelined workflow scheduling: Models and algorithms , 2013, CSUR.

[5]  Jacek Blazewicz,et al.  Scheduling Multiprocessor Tasks to Minimize Schedule Length , 1986, IEEE Transactions on Computers.

[6]  Kun-Lung Wu,et al.  Elastic scaling of data parallel operators in stream processing , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[7]  U. Srivastava,et al.  Ordering Pipelined Query Operators with Precedence Constraints , 2005 .

[8]  Jon B. Weissman,et al.  Fault Tolerant Scheduling in Distributed Networks , 2007 .

[9]  Miron Livny,et al.  Condor: a distributed job scheduler , 2001 .

[10]  Lisa Hellerstein,et al.  Parallel pipelined filter ordering with precedence constraints , 2012, TALG.

[11]  Heinz Gröflin,et al.  Feasible job insertions in the multi-processor-task job shop , 2008, Eur. J. Oper. Res..

[12]  Anne Benoit,et al.  Throughput Optimization for Micro-factories Subject to Failures , 2009, 2009 Eighth International Symposium on Parallel and Distributed Computing.

[13]  Christian Poellabauer,et al.  Dynamic window-constrained scheduling of real-time streams in media servers , 2004, IEEE Transactions on Computers.

[14]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[15]  Karsten Schwan,et al.  Dynamic window-constrained scheduling for multimedia applications , 1999, Proceedings IEEE International Conference on Multimedia Computing and Systems.

[16]  Randy H. Katz,et al.  A view of cloud computing , 2010, CACM.

[17]  Dimitrios Skoutas,et al.  Efficient task replication and management for adaptive fault tolerance in Mobile Grid environments , 2007, Future Gener. Comput. Syst..

[18]  Francisco Vilar Brasileiro,et al.  On the efficacy, efficiency and emergent behavior of task replication in large distributed systems , 2007, Parallel Comput..

[19]  Ed F. Deprettere,et al.  Daedalus: Toward composable multimedia MP-SoC design , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[20]  Michael I. Gordon,et al.  Exploiting coarse-grained task, data, and pipeline parallelism in stream programs , 2006, ASPLOS XII.

[21]  Daniel S. Katz,et al.  Pegasus: A framework for mapping complex scientific workflows onto distributed systems , 2005, Sci. Program..

[22]  Pankaj Jalote,et al.  Fault tolerance in distributed systems , 1994 .

[23]  Ümit V. Çatalyürek,et al.  Investigating the use of GPU-accelerated nodes for SAR image formation , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[24]  Fernando Guirado,et al.  Exploiting Throughput for Pipeline Execution in Streaming Image Processing Applications , 2006, Euro-Par.

[25]  Radu Prodan,et al.  Towards a general model of the multi-criteria workflow scheduling on the grid , 2009, Future Gener. Comput. Syst..