Parallelizing stateful operators in a distributed stream processing system: how, should you and how much?

We consider a distributed stream processing application, expressed as a data-flow graph with operators as vertices connected by streams and deployed over a cluster of compute nodes, where a small subset of the operators are often the performance bottlenecks for the entire application. In cases where a bottleneck operator is stateless, it is obvious that parallelization by splitting the incoming stream among multiple parallel operators deployed on different nodes can help improve performance. However, it is not so obvious when the bottleneck operator is stateful. In such a case, parallelization is much more challenging as it often requires a state sharing mechanism for the parallel operators. Moreover, it incurs additional overheads of required accesses by the parallel operators to shared state and synchronization constructs. In this paper, we propose a parallelization framework for stateful stream processing operators. The framework not only addresses issues related to the system model and support for operator parallelization, but also delves into the theoretical details that model the suitability of parallelization and the optimal degree of parallelism. We have implemented and evaluated our framework in the context of IBM's System S distributed stream processing middleware. While microbenchmarks are used to validate the proposed theoretical model, a parallelized implementation of a moving KNN application is used for the purpose of evaluation.

[1]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[2]  Gade Krishna,et al.  A scalable peer-to-peer lookup protocol for Internet applications , 2012 .

[3]  Kun-Lung Wu,et al.  Elastic scaling of data parallel operators in stream processing , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[4]  Joseph M. Hellerstein,et al.  MapReduce Online , 2010, NSDI.

[5]  Philip S. Yu,et al.  SPADE: the system s declarative stream processing engine , 2008, SIGMOD Conference.

[6]  Michael Stonebraker,et al.  Monitoring Streams - A New Class of Data Management Applications , 2002, VLDB.

[7]  Kun-Lung Wu,et al.  CAPSULE: language and system support for efficient state sharing in distributed stream processing systems , 2012, DEBS.

[8]  Alejandro P. Buchmann,et al.  Complex Event Processing , 2009, it Inf. Technol..

[9]  Michael Stonebraker,et al.  Load management and high availability in the Medusa distributed stream processing system , 2004, SIGMOD '04.

[10]  Navendu Jain,et al.  Design, implementation, and evaluation of the linear road bnchmark on the stream processing core , 2006, SIGMOD Conference.

[11]  Nathan Backman Brown A Fine-Grained , Dynamic Load Distribution Model for Parallel Stream Processing , 2008 .

[12]  Frederick Reiss,et al.  TelegraphCQ: continuous dataflow processing , 2003, SIGMOD '03.

[13]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[14]  Michael Stonebraker,et al.  Fault-tolerance in the Borealis distributed stream processing system , 2005, SIGMOD '05.