Elastic Scaling for Data Stream Processing

This article addresses the profitability problem associated with auto-parallelization of general-purpose distributed data stream processing applications. Auto-parallelization involves locating regions in the application's data flow graph that can be replicated at run-time to apply data partitioning, in order to achieve scale. In order to make auto-parallelization effective in practice, the profitability question needs to be answered: How many parallel channels provide the best throughput? The answer to this question changes depending on the workload dynamics and resource availability at run-time. In this article, we propose an elastic auto-parallelization solution that can dynamically adjust the number of channels used to achieve high throughput without unnecessarily wasting resources. Most importantly, our solution can handle partitioned stateful operators via run-time state migration, which is fully transparent to the application developers. We provide an implementation and evaluation of the system on an industrial-strength data stream processing platform to validate our solution.

[1]  Dominic Battré,et al.  Massively parallel data analysis with PACTs on Nephele , 2010, Proc. VLDB Endow..

[2]  Norman W. Paton,et al.  Autonomic query parallelization using non-dedicated computers: an evaluation of adaptivity options , 2006, 2006 IEEE International Conference on Autonomic Computing.

[3]  Dejan S. Milojicic,et al.  Process migration , 1999, ACM Comput. Surv..

[4]  Joseph M. Hellerstein,et al.  Eddies: continuously adaptive query processing , 2000, SIGMOD '00.

[5]  Yixin Diao,et al.  Feedback Control of Computing Systems , 2004 .

[6]  Gregg Rothermel,et al.  Testing properties of dataflow program operators , 2013, 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[7]  Rares Vernica,et al.  Hyracks: A flexible and extensible foundation for data-intensive computing , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[8]  Jennifer Widom,et al.  STREAM: The Stanford Stream Data Manager , 2003, IEEE Data Eng. Bull..

[9]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[10]  Navendu Jain,et al.  Design, implementation, and evaluation of the linear road bnchmark on the stream processing core , 2006, SIGMOD Conference.

[11]  Kun-Lung Wu,et al.  Auto-parallelizing stateful distributed streaming applications , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[12]  Paulo Marques,et al.  Flood: elastic streaming MapReduce , 2010, DEBS '10.

[13]  Robert Grimm,et al.  A catalog of stream processing optimizations , 2014, ACM Comput. Surv..

[14]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[15]  Joseph M. Hellerstein,et al.  Eddies: continuously adaptive query processing , 2000, SIGMOD 2000.

[16]  Jeffrey F. Naughton,et al.  Maximizing the Output Rate of Multi-Way Join Queries over Streaming Information Sources , 2003, VLDB.

[17]  Yale N. Patt,et al.  Feedback-directed pipeline parallelism , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[18]  Beng Chin Ooi,et al.  Parallelizing stateful operators in a distributed stream processing system: how, should you and how much? , 2012, DEBS.

[19]  J. S. Saini,et al.  Adaptive Query Processing , 2006 .

[20]  Kun-Lung Wu,et al.  IBM Streams Processing Language: Analyzing Big Data in motion , 2013, IBM J. Res. Dev..

[21]  Scott A. Mahlke,et al.  Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[22]  Kun-Lung Wu,et al.  Elastic scaling of data parallel operators in stream processing , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[23]  Philip S. Yu,et al.  Processing high data rate streams in System S , 2011, J. Parallel Distributed Comput..

[24]  Michael I. Gordon,et al.  Exploiting coarse-grained task, data, and pipeline parallelism in stream programs , 2006, ASPLOS XII.

[25]  Michael Stonebraker,et al.  The 8 requirements of real-time stream processing , 2005, SGMD.

[26]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[27]  Andrey Brito,et al.  Active Replication at (Almost) No Cost , 2011, 2011 IEEE 30th International Symposium on Reliable Distributed Systems.

[28]  Ying Xing,et al.  The Design of the Borealis Stream Processing Engine , 2005, CIDR.

[29]  David R. Karger,et al.  Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web , 1997, STOC '97.

[30]  David E. Culler,et al.  SEDA: an architecture for well-conditioned, scalable internet services , 2001, SOSP.

[31]  Joseph M. Hellerstein,et al.  Flux: an adaptive partitioning operator for continuous query systems , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).