PartLy: learning data partitioning for distributed data stream processing

Data partitioning plays a critical role in data stream processing. Current data partitioning techniques use simple, static heuristics that do not incorporate feedback about the quality of the partitioning decision (i.e., fire and forget strategy). Hence, the data partitioner often repeatedly chooses the same decision. In this paper, we argue that reinforcement learning techniques can be applied to address this problem. The use of artificial neural networks can facilitate learning of efficient partitioning policies. We identify the challenges that emerge when applying machine learning techniques to the data partitioning problem for distributed data stream processing. Furthermore, we introduce PartLy, a proof-of-concept data partitioner, and present preliminary results that indicate PartLy's potential to match the performance of state-of-the-art techniques in terms of partitioning quality, while minimizing storage and processing overheads.

[1]  Gianmarco De Francisci Morales,et al.  The power of both choices: Practical load balancing for distributed stream processing engines , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[2]  Marta Mattoso,et al.  Adaptive Virtual Partitioning for OLAP Query Processing in a Database Cluster , 2004, J. Inf. Data Manag..

[3]  Tim Kraska,et al.  Neo: A Learned Query Optimizer , 2019, Proc. VLDB Endow..

[4]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[5]  Alexandros Labrinidis,et al.  A holistic view of stream partitioning costs , 2017, Proc. VLDB Endow..

[6]  Patrick Valduriez,et al.  StreamCloud: A Large Scale Data Streaming System , 2010, 2010 IEEE 30th International Conference on Distributed Computing Systems.

[7]  Ali Ghodsi,et al.  Drizzle: Fast and Adaptable Stream Processing at Scale , 2017, SOSP.

[8]  Olga Papaemmanouil,et al.  Deep Reinforcement Learning for Join Order Enumeration , 2018, aiDM@SIGMOD.

[9]  Patrick Valduriez,et al.  Data Partitioning for Minimizing Transferred Data in MapReduce , 2013, Globe.

[10]  Gianmarco De Francisci Morales,et al.  When two choices are not enough: Balancing at scale in Distributed Stream Processing , 2015, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[11]  Tore Risch,et al.  Massive scale-out of expensive continuous queries , 2011, Proc. VLDB Endow..

[12]  Patrick Valduriez,et al.  Dynamic Workload-Based Partitioning for Large-Scale Databases , 2012, DEXA.

[13]  Hongzi Mao,et al.  Learning scheduling algorithms for data processing clusters , 2018, SIGCOMM.

[14]  Nesime Tatbul,et al.  Scalable Data Partitioning Techniques for Parallel Sliding Window Processing over Data Streams , 2011 .

[15]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[16]  Hongzi Mao,et al.  Variance Reduction for Reinforcement Learning in Input-Driven Environments , 2018, ICLR.