Haren: A Framework for Ad-Hoc Thread Scheduling Policies for Data Streaming Applications

In modern Stream Processing Engines (SPEs), numerous diverse applications, which can differ in aspects such as cost, criticality or latency sensitivity, can co-exist in the same computing node. When these differences need to be considered to control the performance of each application, custom scheduling of operators to threads is of key importance (e.g., when a smart vehicle needs to ensure that safety-critical applications always have access to computational power, while other applications are given lower, variable priorities). Many solutions have been proposed regarding schedulers that allocate threads to operators to optimize specific metrics (e.g., latency) but there is still lack of a tool that allows arbitrarily complex scheduling strategies to be seamlessly plugged on top of an SPE. We propose Haren to fill this gap. More specifically, we (1) formalize the thread scheduling problem in stream processing in a general way, allowing to define ad-hoc scheduling policies, (2) identify the bottlenecks and the opportunities of scheduling in stream processing, (3) distill a compact interface to connect Haren with SPEs, enabling rapid testing of various scheduling policies, (4) illustrate the usability of the framework by integrating it into an actual SPE and (5) provide a thorough evaluation. As we show, Haren makes it is possible to adapt the use of computational resources over time to meet the goals of a variety of scheduling policies.

[1]  Mohamed A. Sharaf,et al.  Class-based continuous query scheduling for data streams , 2009, DMSN '09.

[2]  Alexandros Labrinidis,et al.  Avoiding class warfare: managing continuous queries with differentiated classes of service , 2015, The VLDB Journal.

[3]  Kirk Pruhs,et al.  Algorithms and metrics for processing multiple heterogeneous continuous queries , 2008, TODS.

[4]  Badrish Chandramouli,et al.  Accurate latency estimation in a distributed event processing system , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[5]  S. Muthukrishnan,et al.  Scheduling on-demand broadcasts: new metrics and algorithms , 1998, MobiCom '98.

[6]  Seif Haridi,et al.  Apache Flink™: Stream and Batch Processing in a Single Engine , 2015, IEEE Data Eng. Bull..

[7]  Marina Papatriantafilou,et al.  GeneaLog: Fine-Grained Data Streaming Provenance at the Edge , 2018, Middleware.

[8]  Jignesh M. Patel,et al.  Twitter Heron: Stream Processing at Scale , 2015, SIGMOD Conference.

[9]  Rajeev Motwani,et al.  Operator scheduling in data stream systems , 2004, VLDB 2004.

[10]  Jian Tang,et al.  T-Storm: Traffic-Aware Online Scheduling in Storm , 2014, 2014 IEEE 34th International Conference on Distributed Computing Systems.

[11]  Mohamed A. Sharaf,et al.  Preemptive rate-based operator scheduling in a data stream management system , 2005, The 3rd ACS/IEEE International Conference onComputer Systems and Applications, 2005..

[12]  Roberto Baldoni,et al.  Adaptive online scheduling in storm , 2013, DEBS.

[13]  Ying Xing,et al.  Dynamic load distribution in the Borealis stream processor , 2005, 21st International Conference on Data Engineering (ICDE'05).

[14]  Marina Papatriantafilou,et al.  Viper: A module for communication-layer determinism and scaling in low-latency stream processing , 2018, Future Gener. Comput. Syst..

[15]  Zhiyuan Xu,et al.  Model-free Control for Distributed Stream Data Processing using Deep Reinforcement Learning , 2018, Proc. VLDB Endow..

[16]  Robert Grimm,et al.  A catalog of stream processing optimizations , 2014, ACM Comput. Surv..

[17]  Michael J. Franklin,et al.  Dynamic Pipeline Scheduling for Improving Interactive Query Performance , 2001, VLDB.

[18]  Michael Stonebraker,et al.  Operator Scheduling in a Data Stream Manager , 2003, VLDB.

[19]  Michael A. Bender,et al.  Flow and stretch metrics for scheduling continuous job streams , 1998, SODA '98.

[20]  Kirk Pruhs,et al.  Efficient scheduling of heterogeneous continuous queries , 2006, VLDB.

[21]  Michael Stonebraker,et al.  Monitoring Streams - A New Class of Data Management Applications , 2002, VLDB.

[22]  Rajeev Motwani,et al.  Chain: operator scheduling for memory minimization in data stream systems , 2003, SIGMOD '03.

[23]  Alexandros Labrinidis,et al.  DILoS: A dynamic integrated load manager and scheduler for continuous queries , 2011, 2011 IEEE 27th International Conference on Data Engineering Workshops.

[24]  Craig Chambers,et al.  The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing , 2015, Proc. VLDB Endow..

[25]  Rajmohan Rajaraman,et al.  Online scheduling to minimize average stretch , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[26]  Kun-Lung Wu,et al.  SODA: An Optimizing Scheduler for Large-Scale Stream-Based Distributed Computer Systems , 2008, Middleware.