Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters

Many important "big data" applications need to process data arriving in real time. However, current programming models for distributed stream processing are relatively low-level, often leaving the user to worry about consistency of state across the system and fault recovery. Furthermore, the models that provide fault recovery do so in an expensive manner, requiring either hot replication or long recovery times. We propose a new programming model, discretized streams (D-Streams), that offers a high-level functional programming API, strong consistency, and efficient fault recovery. D-Streams support a new recovery mechanism that improves efficiency over the traditional replication and upstream backup solutions in streaming databases: parallel recovery of lost state across the cluster. We have prototyped D-Streams in an extension to the Spark cluster computing framework called Spark Streaming, which lets users seamlessly intermix streaming, batch and interactive queries.

[1]  Frederick Reiss,et al.  TelegraphCQ: continuous dataflow processing , 2003, SIGMOD '03.

[2]  Jennifer Widom,et al.  STREAM: the stanford stream data manager (demonstration description) , 2003, SIGMOD '03.

[3]  Ying Xing,et al.  Scalable Distributed Stream Processing , 2003, CIDR.

[4]  Jennifer Widom,et al.  STREAM: The Stanford Stream Data Manager , 2003, IEEE Data Eng. Bull..

[5]  GhemawatSanjay,et al.  The Google file system , 2003 .

[6]  Frederick Reiss,et al.  TelegraphCQ: Continuous Dataflow Processing for an Uncertain World , 2003, CIDR.

[7]  Eric A. Brewer,et al.  Highly available, fault-tolerant, parallel dataflows , 2004, SIGMOD '04.

[8]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[9]  Michael Stonebraker,et al.  High-availability algorithms for distributed stream processing , 2005, 21st International Conference on Data Engineering (ICDE'05).

[10]  Ying Xing,et al.  A Cooperative, Self-Configuring High-Availability Solution for Stream Processing , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[11]  Michael Isard,et al.  DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.

[12]  Michael Stonebraker,et al.  Fault-tolerance in the borealis distributed stream processing system , 2008, ACM Trans. Database Syst..

[13]  Michael J. Franklin,et al.  Continuous Analytics: Rethinking Query Processing in a Network-Effect World , 2009, CIDR.

[14]  Jeffrey Davis,et al.  Continuous analytics over discontinuous streams , 2010, SIGMOD Conference.

[15]  Christopher Olston,et al.  Stateful bulk processing for incremental analytics , 2010, SoCC '10.

[16]  Joseph M. Hellerstein,et al.  MapReduce Online , 2010, NSDI.

[17]  Bingsheng He,et al.  Comet: batched stream processing for data intensive distributed computing , 2010, SoCC '10.

[18]  Frank Dabek,et al.  Large-scale Incremental Processing Using Distributed Transactions and Notifications , 2010, OSDI.

[19]  Ken Yocum,et al.  In-situ MapReduce for Log Processing , 2011, USENIX Annual Technical Conference.

[20]  Mendel Rosenblum,et al.  Fast crash recovery in RAMCloud , 2011, SOSP.

[21]  M. Isard,et al.  Naiad : The Animating Spirit of Rivers and Streams , 2011 .

[22]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.