Tutorial: cloud-based data stream processing

In this tutorial we present the results of recent research about the cloud enablement of data streaming systems. We illustrate, based on both industrial as well as academic prototypes, new emerging uses cases and research trends. Specically, we focus on novel approaches for (1) scalability and (2) fault tolerance in large scale distributed streaming systems. In general, new fault tolerance mechanisms strive to be more robust and at the same time introduce less overhead. Novel load balancing approaches focus on elastic scaling over hundreds of instances based on the data and query workload. Finally, we present open challenges for the next generation of cloud-based data stream processing engines.

[1]  Andrey Brito,et al.  Scalable and Low-Latency Data Processing with Stream MapReduce , 2011, 2011 IEEE Third International Conference on Cloud Computing Technology and Science.

[2]  Kun-Lung Wu,et al.  Auto-parallelizing stateful distributed streaming applications , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[3]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[4]  Ying Xing,et al.  The Design of the Borealis Stream Processing Engine , 2005, CIDR.

[5]  Luping Ding,et al.  CAPE: Continuous Query Engine with Heterogeneous-Grained Adaptivity , 2004, VLDB.

[6]  Kun-Lung Wu,et al.  Elastic Scaling for Data Stream Processing , 2014, IEEE Transactions on Parallel and Distributed Systems.

[7]  Daniela Florescu,et al.  Rethinking cost and performance of database systems , 2009, SGMD.

[8]  Claudio Soriente,et al.  StreamCloud: An Elastic and Scalable Data Streaming System , 2012, IEEE Transactions on Parallel and Distributed Systems.

[9]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[10]  Ying Xing,et al.  A Cooperative, Self-Configuring High-Availability Solution for Stream Processing , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[11]  Raul Castro Fernandez,et al.  Integrating scale out and fault tolerance in stream processing using operator state management , 2013, SIGMOD '13.

[12]  Martin Hirzel,et al.  Partition and compose: parallel complex event processing , 2012, DEBS.

[13]  Shivnath Babu,et al.  Execution and optimization of continuous queries with cyclops , 2013, SIGMOD '13.

[14]  Roberto Baldoni,et al.  Adaptive online scheduling in storm , 2013, DEBS.

[15]  Frederick Reiss,et al.  TelegraphCQ: continuous dataflow processing , 2003, SIGMOD '03.

[16]  Qiang Chen,et al.  Aurora : a new model and architecture for data stream management ) , 2006 .

[17]  Kun-Lung Wu,et al.  Elastic scaling of data parallel operators in stream processing , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[18]  Kathrin Abendroth Delta 4 A Generic Architecture For Dependable Distributed Computing , 2016 .

[19]  Badrish Chandramouli,et al.  Accurate latency estimation in a distributed event processing system , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[20]  Joseph M. Hellerstein,et al.  MapReduce Online , 2010, NSDI.

[21]  Andrey Brito,et al.  Active Replication at (Almost) No Cost , 2011, 2011 IEEE 30th International Symposium on Reliable Distributed Systems.

[22]  Roger S. Barga,et al.  Event Correlation and Pattern Detection in CEDR , 2006, EDBT Workshops.

[23]  Daniel Kuhn,et al.  SQPR: Stream query planning with reuse , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[24]  Margo I. Seltzer,et al.  Network-Aware Operator Placement for Stream-Processing Systems , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[25]  Gustavo Alonso,et al.  Virtualizing Stream Processing , 2011, Middleware.

[26]  Yin Yang,et al.  HybMig: A Hybrid Approach to Dynamic Plan Migration for Continuous Queries , 2007, IEEE Transactions on Knowledge and Data Engineering.

[27]  Paolo Bellavista,et al.  Adaptive Fault-Tolerance for Dynamic Resource Provisioning in Distributed Stream Processing Systems , 2014, EDBT.

[28]  Ying Li,et al.  Placement Strategies for Internet-Scale Data Stream Systems , 2008, IEEE Internet Computing.

[29]  David J. DeWitt,et al.  The Niagara Internet Query System , 2001, IEEE Data Eng. Bull..

[30]  Michael Stonebraker,et al.  Fault-tolerance in the Borealis distributed stream processing system , 2005, SIGMOD '05.

[31]  Elke A. Rundensteiner,et al.  Dynamic plan migration for continuous queries over data streams , 2004, SIGMOD '04.

[32]  Daniel Mills,et al.  MillWheel: Fault-Tolerant Stream Processing at Internet Scale , 2013, Proc. VLDB Endow..

[33]  Joseph M. Hellerstein,et al.  Flux: an adaptive partitioning operator for continuous query systems , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[34]  Ying Xing,et al.  Providing resiliency to load variations in distributed stream processing , 2006, VLDB.

[35]  Navendu Jain,et al.  Design, implementation, and evaluation of the linear road bnchmark on the stream processing core , 2006, SIGMOD Conference.

[36]  Zhengping Qian,et al.  TimeStream: reliable stream computation in the cloud , 2013, EuroSys '13.

[37]  Kun-Lung Wu,et al.  SODA: An Optimizing Scheduler for Large-Scale Stream-Based Distributed Computer Systems , 2008, Middleware.

[38]  Randy H. Katz,et al.  A view of cloud computing , 2010, CACM.

[39]  I. Bey,et al.  Delta-4: A Generic Architecture for Dependable Distributed Computing , 1991, Research Reports ESPRIT.

[40]  Leonardo Neumeyer,et al.  S4: Distributed Stream Computing Platform , 2010, 2010 IEEE International Conference on Data Mining Workshops.