论文信息 - Elastic Scaling for Data Stream Processing

Elastic Scaling for Data Stream Processing

This article addresses the profitability problem associated with auto-parallelization of general-purpose distributed data stream processing applications. Auto-parallelization involves locating regions in the application's data flow graph that can be replicated at run-time to apply data partitioning, in order to achieve scale. In order to make auto-parallelization effective in practice, the profitability question needs to be answered: How many parallel channels provide the best throughput? The answer to this question changes depending on the workload dynamics and resource availability at run-time. In this article, we propose an elastic auto-parallelization solution that can dynamically adjust the number of channels used to achieve high throughput without unnecessarily wasting resources. Most importantly, our solution can handle partitioned stateful operators via run-time state migration, which is fully transparent to the application developers. We provide an implementation and evaluation of the system on an industrial-strength data stream processing platform to validate our solution.

[1] Dominic Battré,et al. Massively parallel data analysis with PACTs on Nephele , 2010, Proc. VLDB Endow..

[2] Norman W. Paton,et al. Autonomic query parallelization using non-dedicated computers: an evaluation of adaptivity options , 2006, 2006 IEEE International Conference on Autonomic Computing.

[3] Dejan S. Milojicic,et al. Process migration , 1999, ACM Comput. Surv..

[4] Joseph M. Hellerstein,et al. Eddies: continuously adaptive query processing , 2000, SIGMOD '00.

[5] Yixin Diao,et al. Feedback Control of Computing Systems , 2004 .

[6] Gregg Rothermel,et al. Testing properties of dataflow program operators , 2013, 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[7] Rares Vernica,et al. Hyracks: A flexible and extensible foundation for data-intensive computing , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[8] Jennifer Widom,et al. STREAM: The Stanford Stream Data Manager , 2003, IEEE Data Eng. Bull..

[9] Sergey Brin,et al. The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[10] Navendu Jain,et al. Design, implementation, and evaluation of the linear road bnchmark on the stream processing core , 2006, SIGMOD Conference.

[11] Kun-Lung Wu,et al. Auto-parallelizing stateful distributed streaming applications , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[12] Paulo Marques,et al. Flood: elastic streaming MapReduce , 2010, DEBS '10.

[13] Robert Grimm,et al. A catalog of stream processing optimizations , 2014, ACM Comput. Surv..

[14] Sanjay Ghemawat,et al. MapReduce: a flexible data processing tool , 2010, CACM.

[15] Joseph M. Hellerstein,et al. Eddies: continuously adaptive query processing , 2000, SIGMOD 2000.

[16] Jeffrey F. Naughton,et al. Maximizing the Output Rate of Multi-Way Join Queries over Streaming Information Sources , 2003, VLDB.

[17] Yale N. Patt,et al. Feedback-directed pipeline parallelism , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[18] Beng Chin Ooi,et al. Parallelizing stateful operators in a distributed stream processing system: how, should you and how much? , 2012, DEBS.

[19] J. S. Saini,et al. Adaptive Query Processing , 2006 .

[20] Kun-Lung Wu,et al. IBM Streams Processing Language: Analyzing Big Data in motion , 2013, IBM J. Res. Dev..

[21] Scott A. Mahlke,et al. Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[22] Kun-Lung Wu,et al. Elastic scaling of data parallel operators in stream processing , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[23] Philip S. Yu,et al. Processing high data rate streams in System S , 2011, J. Parallel Distributed Comput..

[24] Michael I. Gordon,et al. Exploiting coarse-grained task, data, and pipeline parallelism in stream programs , 2006, ASPLOS XII.

[25] Michael Stonebraker,et al. The 8 requirements of real-time stream processing , 2005, SGMD.

[26] Yuan Yu,et al. Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[27] Andrey Brito,et al. Active Replication at (Almost) No Cost , 2011, 2011 IEEE 30th International Symposium on Reliable Distributed Systems.

[28] Ying Xing,et al. The Design of the Borealis Stream Processing Engine , 2005, CIDR.

[29] David R. Karger,et al. Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web , 1997, STOC '97.

[30] David E. Culler,et al. SEDA: an architecture for well-conditioned, scalable internet services , 2001, SOSP.

[31] Joseph M. Hellerstein,et al. Flux: an adaptive partitioning operator for continuous query systems , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).