Megaphone: Latency-conscious state migration for distributed streaming dataflows

We design and implement Megaphone, a data migration mechanism for stateful distributed dataflow engines with latency objectives. When compared to existing migration mechanisms, Megaphone has the following differentiating characteristics: (i) migrations can be subdivided to a configurable granularity to avoid latency spikes, and (ii) migrations can be prepared ahead of time to avoid runtime coordination. Megaphone is implemented as a library on an unmodified timely dataflow implementation, and provides an operator interface compatible with its existing APIs. We evaluate Megaphone on established benchmarks with varying amounts of state and observe that compared to naive approaches Megaphone reduces service latencies during reconfiguration by orders of magnitude without significantly increasing steady-state overhead.

[1]  Bernhard Mitschang,et al.  ProRea: live database migration for multi-tenant RDBMS with snapshot isolation , 2013, EDBT '13.

[2]  Frank McSherry,et al.  Latency-conscious dataflow reconfiguration , 2018, BeyondMR@SIGMOD.

[3]  Craig Chambers,et al.  The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing , 2015, Proc. VLDB Endow..

[4]  Elke A. Rundensteiner,et al.  Dynamic plan migration for continuous queries over data streams , 2004, SIGMOD '04.

[5]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[6]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[7]  Daniel Mills,et al.  MillWheel: Fault-Tolerant Stream Processing at Internet Scale , 2013, Proc. VLDB Endow..

[8]  Seif Haridi,et al.  Apache Flink™: Stream and Batch Processing in a Single Engine , 2015, IEEE Data Eng. Bull..

[9]  Reynold Xin,et al.  Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark , 2018, SIGMOD Conference.

[10]  Divyakant Agrawal,et al.  Zephyr: live migration in shared nothing databases for elastic cloud platforms , 2011, SIGMOD '11.

[11]  Seif Haridi,et al.  State Management in Apache Flink®: Consistent Stateful Distributed Stream Processing , 2017, Proc. VLDB Endow..

[12]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[13]  Aoying Zhou,et al.  Parallel Stream Processing Against Workload Skewness and Variance , 2017, HPDC.

[14]  Prashant J. Shenoy,et al.  "Cut me some slack": latency-aware live migration for databases , 2012, EDBT '12.

[15]  Divyakant Agrawal,et al.  Squall: Fine-Grained Live Reconfiguration for Partitioned Main Memory Databases , 2015, SIGMOD Conference.

[16]  Claudio Soriente,et al.  StreamCloud: An Elastic and Scalable Data Streaming System , 2012, IEEE Transactions on Parallel and Distributed Systems.

[17]  Kian-Lee Tan,et al.  ChronoStream: Elastic stateful stream computation in the cloud , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[18]  Raul Castro Fernandez,et al.  Integrating scale out and fault tolerance in stream processing using operator state management , 2013, SIGMOD '13.

[19]  Paolo Costa,et al.  Chi: A Scalable and Programmable Control Plane for Distributed Stream Processing Systems , 2018, Proc. VLDB Endow..

[20]  Thomas S. Heinze,et al.  Latency-aware elastic scaling for distributed data stream processing systems , 2014, DEBS '14.

[21]  Vasiliki Kalavri,et al.  Three steps is all you need: fast, accurate, automatic scaling decisions for distributed streaming dataflows , 2018, OSDI.

[22]  Peter A. Tucker,et al.  NEXMark – A Benchmark for Queries over Data Streams DRAFT , 2002 .

[23]  M. Abadi,et al.  Naiad: a timely dataflow system , 2013, SOSP.

[24]  Divyakant Agrawal,et al.  Albatross: Lightweight Elasticity in Shared Storage Databases for the Cloud using Live Data Migration , 2011, Proc. VLDB Endow..

[25]  Joseph M. Hellerstein,et al.  Flux: an adaptive partitioning operator for continuous query systems , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[26]  Martín Abadi,et al.  Formal Analysis of a Distributed Algorithm for Tracking Progress , 2013, FMOODS/FORTE.

[27]  Sriram Rao,et al.  Dhalion: Self-Regulating Stream Processing in Heron , 2017, Proc. VLDB Endow..

[28]  Weng-Fai Wong,et al.  Gloss: Seamless Live Reconfiguration and Reoptimization of Stream Programs , 2018, ASPLOS.