PanJoin: A Partition-based Adaptive Stream Join

In stream processing, stream join is one of the critical sources of performance bottlenecks. The sliding-window-based stream join provides a precise result but consumes considerable computational resources. The current solutions lack support for the join predicates on large windows. These algorithms and their hardware accelerators are either limited to equi-join or use a nested loop join to process all the requests. In this paper, we present a new algorithm called PanJoin which has high throughput on large windows and supports both equi-join and non-equi-join. PanJoin implements three new data structures to reduce computations during the probing phase of stream join. We also implement the most hardware-friendly data structure, called BI-Sort, on FPGA. Our evaluation shows that PanJoin outperforms several recently proposed stream join methods by more than 1000x, and it also adapts well to highly skewed data.

[1]  Jun Yang,et al.  A Survey of Join Processing in Data Streams , 2007, Data Streams - Models and Algorithms.

[2]  Michael J. Franklin,et al.  Dynamic Pipeline Scheduling for Improving Interactive Query Performance , 2001, VLDB.

[3]  Haifeng Jiang,et al.  Photon: fault-tolerant and scalable joining of continuous data streams , 2013, SIGMOD '13.

[4]  Hans-Arno Jacobsen,et al.  Multi-query Stream Processing on FPGAs , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[5]  Gustavo Alonso,et al.  Complex event detection at wire speed with FPGAs , 2010, Proc. VLDB Endow..

[6]  Indranil Gupta,et al.  Stateful Scalable Stream Processing at LinkedIn , 2017, Proc. VLDB Endow..

[7]  Beng Chin Ooi,et al.  Scalable Distributed Stream Join Processing , 2015, SIGMOD Conference.

[8]  Dionisios N. Pnevmatikatos,et al.  An FPGA-based high-throughput stream join architecture , 2016, 2016 26th International Conference on Field Programmable Logic and Applications (FPL).

[9]  Jeffrey F. Naughton,et al.  Evaluating window joins over unbounded streams , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[10]  Weng-Fai Wong,et al.  A computing origami: Folding streams in FPGAs , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[11]  Christoph Koch,et al.  Squall: Scalable Real-time Analytics , 2016, Proc. VLDB Endow..

[12]  Jignesh M. Patel,et al.  Storm@twitter , 2014, SIGMOD Conference.

[13]  A. N. Wilschut,et al.  Dataflow query execution in a parallel main-memory environment , 1991, Distributed and Parallel Databases.

[14]  Yufei Tao,et al.  RPJ: producing fast join results on streams through rate-based optimization , 2005, SIGMOD '05.

[15]  Philip S. Yu,et al.  CellJoin: a parallel stream join operator for the cell processor , 2009, The VLDB Journal.

[16]  Hans-Arno Jacobsen,et al.  Configurable hardware-based streaming architecture using Online Programmable-Blocks , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[17]  Hans-Arno Jacobsen,et al.  SplitJoin: A Scalable, Low-latency Stream Join Architecture with Adjustable Ordering Precision , 2016, USENIX Annual Technical Conference.

[18]  Marina Papatriantafilou,et al.  Scalejoin: A deterministic, disjoint-parallel and skew-resilient stream join , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[19]  Jens Teubner,et al.  Low-Latency Handshake Join , 2014, Proc. VLDB Endow..

[20]  Jens Teubner,et al.  How soccer players would do stream joins , 2011, SIGMOD '11.

[21]  Jiangchuan Liu,et al.  Statistics and Social Network of YouTube Videos , 2008, 2008 16th Interntional Workshop on Quality of Service.

[22]  Yu Ge,et al.  An indexed non-equijoin algorithm based on sliding windows over data streams , 2008, Wuhan University Journal of Natural Sciences.