Executing Stream Joins on the Cell Processor

Low-latency and high-throughput processing are key requirements of data stream management systems (DSMSs). Hence, multi-core processors that provide high aggregate processing capacity are ideal matches for executing costly DSMS operators. The recently developed Cell processor is a good example of a heterogeneous multi-core architecture and provides a powerful platform for executing data stream operators with high-performance. On the down side, exploiting the full potential of a multi-core processor like Cell is often challenging, mainly due to the heterogeneous nature of the processing elements, the software managed local memory at the co-processor side, and the unconventional programming model in general. In this paper, we study the problem of scalable execution of windowed stream join operators on multi-core processors, and specifically on the Cell processor. By examining various aspects of join execution flow, we determine the right set of techniques to apply in order to minimize the sequential segments and maximize parallelism. Concretely, we show that basic windows coupled with low-overhead pointer-shifting techniques can be used to achieve efficient join window partitioning, column-oriented join window organization can be used to minimize scattered data transfers, delay-optimized double buffering can be used for effective pipelining, rate-aware batching can be used to balance join throughput and tuple delay, and finally SIMD (single-instruction multiple-data) optimized operator code can be used to exploit data parallelism. Our experimental results show that, following the design guidelines and implementation techniques outlined in this paper, windowed stream joins can achieve high scalability (linear in the number of co-processors) by making efficient use of the extensive hardware parallelism provided by the Cell processor (reaching data processing rates of a 13 GB/sec) and significantly surpass the performance obtained form conventional high-end processors (supporting a combined input stream rate of 2000 tuples/sec using 15 minutes windows and without dropping any tuples, resulting in a 8.3 times higher output rate compared to an SSE implementation on dual 3.2Ghz Intel Xeon).

[1]  David A. Bader,et al.  On the Design and Analysis of Irregular Algorithms on the Cell Processor: A Case Study of List Ranking , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[2]  Jeffrey F. Naughton,et al.  Maximizing the Output Rate of Multi-Way Join Queries over Streaming Information Sources , 2003, VLDB.

[3]  Dinesh Manocha,et al.  GPUTeraSort: high performance graphics co-processor sorting for large database management , 2006, SIGMOD Conference.

[4]  Michael Stonebraker,et al.  Load Shedding in a Data Stream Manager , 2003, VLDB.

[5]  J. W. Backus,et al.  Can programming be liberated from the von Neumann style , 1977 .

[6]  Jennifer Widom,et al.  STREAM: The Stanford Stream Data Manager , 2003, IEEE Data Eng. Bull..

[7]  Fabrizio Petrini,et al.  Multicore Surprises: Lessons Learned from Optimizing Sweep3D on the Cell Broadband Engine , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[8]  David J. DeWitt,et al.  An Evaluation of Non-Equijoin Algorithms , 1991, VLDB.

[9]  Philip S. Yu,et al.  ViCo: an adaptive distributed video correlation system , 2006, MM '06.

[10]  Abhinandan Das,et al.  Approximate join processing over data streams , 2003, SIGMOD '03.

[11]  Dinesh Manocha,et al.  Fast and approximate stream mining of quantiles and frequencies using graphics processors , 2005, SIGMOD '05.

[12]  Jeffrey F. Naughton,et al.  Static optimization of conjunctive queries with sliding windows over infinite streams , 2004, SIGMOD '04.

[13]  John W. Backus,et al.  Can programming be liberated from the von Neumann style?: a functional style and its algebra of programs , 1978, CACM.

[14]  Philip S. Yu,et al.  Adaptive load shedding for windowed stream joins , 2005, CIKM '05.

[15]  Philip S. Yu,et al.  Challenges and Experience in Prototyping a Multi-Modal Stream Analytic and Monitoring Application on System S , 2007, VLDB.

[16]  Jennifer Widom,et al.  Memory-Limited Execution of Windowed Stream Joins , 2004, VLDB.

[17]  Babak Falsafi,et al.  Accelerating database operators using a network processor , 2005, DaMoN '05.

[18]  Frederick Reiss,et al.  TelegraphCQ: Continuous Dataflow Processing for an Uncertain World , 2003, CIDR.

[19]  Fabrizio Petrini,et al.  Cell Multiprocessor Communication Network: Built for Speed , 2006, IEEE Micro.

[20]  Michael Gschwind,et al.  Optimizing Compiler for the CELL Processor , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[21]  BackusJohn Can programming be liberated from the von Neumann style , 1978 .

[22]  Jeffrey F. Naughton,et al.  Evaluating window joins over unbounded streams , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[23]  Lukasz Golab,et al.  Processing Sliding Window Multi-Joins in Continuous Queries over Data Streams , 2003, VLDB.

[24]  Michael Stonebraker,et al.  C-Store: A Column-oriented DBMS , 2005, VLDB.

[25]  Michael Stonebraker,et al.  Retrospective on Aurora , 2004, The VLDB Journal.

[26]  Walid G. Aref,et al.  Stream window join: tracking moving objects in sensor-network databases , 2003, 15th International Conference on Scientific and Statistical Database Management, 2003..

[27]  Navendu Jain,et al.  Design, implementation, and evaluation of the linear road bnchmark on the stream processing core , 2006, SIGMOD Conference.

[28]  Divyakant Agrawal,et al.  Hardware acceleration for spatial selections and joins , 2003, SIGMOD '03.

[29]  David J. DeWitt,et al.  GAMMA - A High Performance Dataflow Database Machine , 1986, VLDB.

[30]  Philip S. Yu,et al.  Limiting factors of join performance on parallel processors , 1989, [1989] Proceedings. Fifth International Conference on Data Engineering.

[31]  Michael Stonebraker,et al.  The Case for Shared Nothing , 1985, HPTS.