Stream processing platforms for analyzing big dynamic data

Abstract Nowadays, data is produced in every aspect of our lives, leading to a massive amount of information generated every second. However, this vast amount is often too large to be stored and for many applications the information contained in these data streams is only useful when it is fresh. Batch processing platforms like Hadoop MapReduce do not fit these needs as they require to collect data on disk and process it repeatedly. Therefore, modern data processing engines combine the scalability of distributed architectures with the one-pass semantics of traditional stream engines. In this paper, we survey the current state of the art in scalable stream processing from a user perspective. We examine and describe their architecture, execution model, programming interface, and data analysis support as well as discuss the challenges and limitations of their APIs. In this connection, we introduce Piglet, an extended Pig Latin language and code generator that compiles (extended) Pig Latin code into programs for various data processing platforms. Thereby, we discuss the mapping to platform-specic concepts in order to provide a uniform view.

[1]  Bugra Gedik Partitioning functions for stateful data parallelism in stream processing , 2013, The VLDB Journal.

[2]  Johannes Gehrke,et al.  Distributed event stream processing with non-deterministic finite automata , 2009, DEBS '09.

[3]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[4]  Alessandro Margara,et al.  Processing flows of information: From data stream to complex event processing , 2012, CSUR.

[5]  Craig Chambers,et al.  The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing , 2015, Proc. VLDB Endow..

[6]  Seif Haridi,et al.  Lightweight Asynchronous Snapshots for Distributed Dataflows , 2015, ArXiv.

[7]  Alain Biem,et al.  IBM infosphere streams for scalable, real-time, intelligent transportation services , 2010, SIGMOD Conference.

[8]  Daniel Mills,et al.  MillWheel: Fault-Tolerant Stream Processing at Internet Scale , 2013, Proc. VLDB Endow..

[9]  Kun-Lung Wu,et al.  Elastic Scaling for Data Stream Processing , 2014, IEEE Transactions on Parallel and Distributed Systems.

[10]  Martin Hirzel,et al.  Partition and compose: parallel complex event processing , 2012, DEBS.

[11]  Gianmarco De Francisci Morales SAMOA: a platform for mining big data streams , 2013, WWW '13 Companion.

[12]  Peter R. Pietzuch,et al.  Distributed complex event processing with query rewriting , 2009, DEBS '09.

[13]  Joseph M. Hellerstein,et al.  Flux: an adaptive partitioning operator for continuous query systems , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[14]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[15]  Ying Xing,et al.  The Design of the Borealis Stream Processing Engine , 2005, CIDR.

[16]  Scott Shenker,et al.  Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters , 2012, HotCloud.

[17]  Tore Risch,et al.  Massive scale-out of expensive continuous queries , 2011, Proc. VLDB Endow..

[18]  Katja Hose,et al.  SPARQling Pig - Processing Linked Data with Pig Latin , 2015, BTW.

[19]  Leonardo Neumeyer,et al.  S4: Distributed Stream Computing Platform , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[20]  Taghi M. Khoshgoftaar,et al.  A survey of open source tools for machine learning with big data in the Hadoop ecosystem , 2015, Journal of Big Data.

[21]  Qiang Chen,et al.  Aurora : a new model and architecture for data stream management ) , 2006 .

[22]  Gianmarco De Francisci Morales,et al.  Big Data Stream Learning with SAMOA , 2014, 2014 IEEE International Conference on Data Mining Workshop.

[23]  Jennifer Widom,et al.  STREAM: The Stanford Data Stream Management System , 2016, Data Stream Management.

[24]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[25]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.