Iterative Scheduling for Distributed Stream Processing Systems

Nowadays data stream processing systems need to efficiently handle large volumes of data in near real-time. To achieve this, the schedulers within such systems minimise the data movement between highly communicating tasks, improving system throughput. However, finding an optimal schedule for these systems is NP-hard. In this research, we propose a heuristic scheduling algorithm which reliably and efficiently finds the highly communicating tasks by exploiting graph partitioning algorithms and a mathematical optimisation software package. We evaluate our scheduler with two popular existing schedulers R-Storm and Aniello et al.'s 'Online scheduler' using two real-world applications and show that our proposed scheduler outperforms R-Storm, increasing throughput by between 3% and 30% and Online scheduler by 20--86% as a result of finding a more efficient schedule.