SWOOP: Top-k Similarity Joins over Set Streams

We provide efficient support for applications that aim to continuously find pairs of similar sets in rapid streams of sets. A prototypical example setting is that of tweets. A tweet is a set of words, and Twitter emits about half a billion tweets per day. Our solution makes it possible to efficiently maintain the top-$k$ most similar tweets from a pair of rapid Twitter streams, e.g., to discover similar trends in two cities if the streams concern cities. Using a sliding window model, the top-$k$ result changes as new sets in the stream enter the window or existing ones leave the window. Maintaining the top-$k$ result under rapid streams is challenging. First, when a set arrives, it may form a new pair for the top-$k$ result with any set already in the window. Second, when a set leaves the window, all its pairings in the top-$k$ are invalidated and must be replaced. It is not enough to maintain the $k$ most similar pairs, as less similar pairs may eventually be promoted to the top-$k$ result. A straightforward solution that pairs every new set with all sets in the window and keeps all pairs for maintaining the top-$k$ result is memory intensive and too slow. We propose SWOOP, a highly scalable stream join algorithm that solves these issues. Novel indexing techniques and sophisticated filters efficiently prune useless pairs as new sets enter the window. SWOOP incrementally maintains a stock of similar pairs to update the top-$k$ result at any time, and the stock is shown to be minimal. Our experiments confirm that SWOOP can deal with stream rates that are orders of magnitude faster than the rates of existing approaches.

[1]  Lijun Chang,et al.  Leveraging Set Relations in Exact Set Similarity Join , 2017, Proc. VLDB Endow..

[2]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[3]  Karl Aberer,et al.  Time- and Space-Efficient Sliding Window Top-k Query Processing , 2015, TODS.

[4]  Aristides Gionis,et al.  Streaming Similarity Self-Join , 2016, Proc. VLDB Endow..

[5]  Nikolaus Augsten,et al.  An Empirical Evaluation of Set Similarity Join Techniques , 2016, Proc. VLDB Endow..

[6]  Haixun Wang,et al.  A Generic Framework for Top-k Pairs and Top-k Objects Queries over Sliding Windows , 2014, IEEE Trans. Knowl. Data Eng..

[7]  Haixun Wang,et al.  A Generic Framework for Top-${\schmi k}$ Pairs and Top- ${\schmi k}$ Objects Queries over Sliding Windows , 2014, IEEE Transactions on Knowledge and Data Engineering.

[8]  Stefan Stieglitz,et al.  Information Diffusion between Twitter and Online Media , 2018, ICIS.

[9]  Nikolaus Augsten,et al.  PEL: Position-Enhanced Length Filter for Set Similarity Joins , 2014, Grundlagen von Datenbanken.

[10]  Kannan Srinivasan,et al.  Modeling Online Browsing and Path Analysis Using Clickstream Data , 2004 .

[11]  Michael Ley,et al.  DBLP - Some Lessons Learned , 2009, Proc. VLDB Endow..

[12]  Kyriakos Mouratidis,et al.  Continuous monitoring of top-k queries over sliding windows , 2006, SIGMOD Conference.

[13]  R. Varshney,et al.  Supporting top-k join queries in relational databases , 2011 .

[14]  Guoliang Li,et al.  Can we beat the prefix filtering?: an adaptive framework for similarity join and search , 2012, SIGMOD Conference.

[15]  Guoliang Li,et al.  An Efficient Partition Based Method for Exact Set Similarity Joins , 2015, Proc. VLDB Endow..

[16]  Theo Härder,et al.  Generalizing prefix filtering to improve set similarity joins , 2011, Inf. Syst..

[17]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[18]  Yoshiharu Ishikawa,et al.  Local Similarity Search for Unstructured Text , 2016, SIGMOD Conference.

[19]  Nikos Mamoulis,et al.  Spatio-textual similarity joins , 2012, Proc. VLDB Endow..

[20]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[21]  Xuemin Lin,et al.  Top-k Set Similarity Joins , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[22]  Gang Wang,et al.  Unsupervised Clickstream Clustering for User Behavior Analysis , 2016, CHI.