Task allocation for distributed stream processing

There is a growing demand for live, on-the-fly processing of increasingly large amounts of data. In order to ensure the timely and reliable processing of streaming data, a variety of distributed stream processing architectures and platforms have been developed, which handle the fundamental tasks of (dynamically) assigning processing tasks to the currently available physical resources and routing streaming data between these resources. However, while there are plenty of platforms offering such functionality, the theory behind it is not well understood. In particular, it is unclear how to best allocate the processing tasks to the given resources. In this paper, we establish a theoretical foundation by formally defining a task allocation problem for distributed stream processing, which we prove to be NP-hard. Furthermore, we propose an approximation algorithm for the class of series-parallel decomposable graphs, which captures a broad range of common stream processing applications. The algorithm achieves a constant-factor approximation under the assumptions that the number of resources scales at least logarithmically with the number of computational tasks and the computational cost of the tasks dominates the cost of communication.

[1]  Tobias Schüle,et al.  Work Stealing Strategies for Parallel Stream Processing in Soft Real-Time Systems , 2012, ARCS.

[2]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[3]  Ying Xing,et al.  Providing resiliency to load variations in distributed stream processing , 2006, VLDB.

[4]  Michael Stonebraker,et al.  Monitoring Streams - A New Class of Data Management Applications , 2002, VLDB.

[5]  Frederick Reiss,et al.  TelegraphCQ: continuous dataflow processing , 2003, SIGMOD '03.

[6]  Sönke Hartmann,et al.  A survey of variants and extensions of the resource-constrained project scheduling problem , 2010, Eur. J. Oper. Res..

[7]  Stephan Mertens The Easiest Hard Problem: Number Partitioning , 2006, Computational Complexity and Statistical Physics.

[8]  Qiang Chen,et al.  Aurora : a new model and architecture for data stream management ) , 2006 .

[9]  Kun-Lung Wu,et al.  COLA: Optimizing Stream Processing Applications via Graph Partitioning , 2009, Middleware.

[10]  Leonardo Neumeyer,et al.  S4: Distributed Stream Computing Platform , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[11]  Nobuji Saito,et al.  Linear-time computability of combinatorial problems on series-parallel graphs , 1982, JACM.

[12]  Ying Xing,et al.  The Design of the Borealis Stream Processing Engine , 2005, CIDR.

[13]  Daniel Mills,et al.  MillWheel: Fault-Tolerant Stream Processing at Internet Scale , 2013, Proc. VLDB Endow..

[14]  Ying Xing,et al.  Scalable Distributed Stream Processing , 2003, CIDR.

[15]  T.C.E. Cheng,et al.  Single-machine scheduling with deteriorating jobs under a series-parallel graph constraint , 2008, Comput. Oper. Res..

[16]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[17]  Jie Wu Dynamic Load Distribution , 2017 .

[18]  Anton Riabov,et al.  Scalable Planning for Distributed Stream Processing Systems , 2006, ICAPS.

[19]  Philip S. Yu,et al.  SPADE: the system s declarative stream processing engine , 2008, SIGMOD Conference.

[20]  Ying Xing,et al.  Dynamic load distribution in the Borealis stream processor , 2005, 21st International Conference on Data Engineering (ICDE'05).

[21]  Navendu Jain,et al.  Adaptive Control of Extreme-scale Stream Processing Systems , 2006, 26th IEEE International Conference on Distributed Computing Systems (ICDCS'06).

[22]  Kun-Lung Wu,et al.  SODA: An Optimizing Scheduler for Large-Scale Stream-Based Distributed Computer Systems , 2008, Middleware.

[23]  Evripidis Bampis,et al.  Scheduling UET-UCT Series-Parallel Graphs on Two Processors , 1996, Theor. Comput. Sci..

[24]  Jennifer Widom,et al.  STREAM: the stanford stream data manager (demonstration description) , 2003, SIGMOD '03.

[25]  Margo I. Seltzer,et al.  Network-Aware Operator Placement for Stream-Processing Systems , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[26]  Victor J. Rayward-Smith,et al.  UET scheduling with unit interprocessor communication delays , 1987, Discret. Appl. Math..

[27]  Stanley B. Zdonik,et al.  Staying FIT: Efficient Load Shedding Techniques for Distributed Stream Processing , 2007, VLDB.

[28]  Stratis Viglas,et al.  Fast Heuristics for Near-Optimal Task Allocation in Data Stream Processing over Clusters , 2014, CIKM.