Top-k Spatio-Textual Similarity Join

With the development of location-based services (LBS), LBS users are generating more and more spatio-textual data, e.g., checkins and attraction reviews. Since a spatio-textual entity may have different representations, possibly due to GPS deviations or typographical errors, it calls for effective methods to integrate the spatio-textual data from different data sources. In this paper, we study the problem of top- $k$ spatio-textual similarity join ( Topk-STJoin ), which identifies the $k$ most similar pairs from two spatio-textual data sets. One big challenge in Topk-STJoin is to efficiently identify the top- $k$ similar pairs by considering both textual relevancy and spatial proximity. Traditional join algorithms that consider only one dimension (textual or spatial) are inefficient because they cannot utilize the pruning ability on the other dimension. To address this challenge, we propose a signature-based top- $k$ join framework. We first generate a spatio-textual signature set for each object such that if two objects are in the top- $k$ similar pairs, their signature sets must overlap. With this property, we can prune large numbers of dissimilar pairs without common signatures. We find that the order of accessing the signatures has a significant effect on the performance. So, we compute an upper bound for each signature and propose a best-first accessing method that preferentially accesses signatures with large upper bounds while those pairs with small upper bounds can be pruned. We prove the optimality of our best-first accessing method. Next, we optimize the spatio-textual signatures and propose progressive signatures to further improve the pruning power. Experimental results on real-world datasets show that our algorithm achieves high performance and good scalability, and significantly outperforms baseline approaches.

[1]  Christian S. Jensen,et al.  Spatial Keyword Query Processing: An Experimental Evaluation , 2013, Proc. VLDB Endow..

[2]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[3]  Nikos Mamoulis,et al.  Spatio-textual similarity joins , 2012, Proc. VLDB Endow..

[4]  Dieter Pfoser,et al.  Extraction, integration and exploration of crowdsourced geospatial content from multiple web sources , 2014, SIGSPATIAL/GIS.

[5]  Thomas Heinis,et al.  TOUCH: in-memory spatial join by hierarchical data-oriented partitioning , 2013, SIGMOD '13.

[6]  Hanan Samet,et al.  Incremental distance join algorithms for spatial databases , 1998, SIGMOD '98.

[7]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[8]  Guoliang Li,et al.  PASS-JOIN: A Partition-based Method for Similarity Joins , 2011, Proc. VLDB Endow..

[9]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[10]  Anthony K. H. Tung,et al.  Scalable top-k spatial keyword search , 2013, EDBT '13.

[11]  Yang Wang,et al.  Location-aware publish/subscribe , 2013, KDD.

[12]  Sartaj Sahni,et al.  Data Structures, Algorithms, and Applications in C++ , 1997 .

[13]  Christian S. Jensen,et al.  Joint Top-K Spatial Keyword Query Processing , 2012, IEEE Transactions on Knowledge and Data Engineering.

[14]  Xuemin Lin,et al.  Top-k Set Similarity Joins , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[15]  Nikos Mamoulis,et al.  Efficient Top-k Spatial Distance Joins , 2013, SSTD.

[16]  Guoliang Li,et al.  Star-Join: spatio-textual similarity join , 2012, CIKM '12.

[17]  Guoliang Li,et al.  A Prefix-Filter based Method for Spatio-Textual Similarity Join , 2014, IEEE Transactions on Knowledge and Data Engineering.

[18]  Ken C. K. Lee,et al.  IR-Tree: An Efficient Index for Geographic Document Search , 2011, IEEE Trans. Knowl. Data Eng..

[19]  Ting Wang,et al.  Efficient Filtering Algorithms for Location-Aware Publish/Subscribe , 2015, IEEE Transactions on Knowledge and Data Engineering.

[20]  Athanasios K. Tsakalidis,et al.  Data Structures , 2011 .

[21]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near duplicate detection , 2008, WWW.

[22]  Sukho Lee,et al.  Adaptive multi-stage distance join processing , 2000, SIGMOD 2000.

[23]  João B. Rocha-Junior,et al.  Efficient Processing of Top-k Spatial Keyword Queries , 2011, SSTD.

[24]  Christian S. Jensen,et al.  Efficient Retrieval of the Top-k Most Relevant Spatial Web Objects , 2009, Proc. VLDB Endow..

[25]  Yiqun Liu,et al.  A location-aware publish/subscribe framework for parameterized spatio-textual subscriptions , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[26]  Yannis Manolopoulos,et al.  Closest pair queries in spatial databases , 2000, SIGMOD 2000.

[27]  Guoliang Li,et al.  Can we beat the prefix filtering?: an adaptive framework for similarity join and search , 2012, SIGMOD Conference.

[28]  Guoliang Li,et al.  MassJoin: A mapreduce-based method for scalable string similarity joins , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[29]  Naphtali Rishe,et al.  Keyword Search on Spatial Databases , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[30]  Xuemin Lin,et al.  Inverted Linear Quadtree: Efficient Top K Spatial Keyword Search , 2016, IEEE Transactions on Knowledge and Data Engineering.

[31]  Jing Xu,et al.  DESKS: Direction-Aware Spatial Keyword Search , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[32]  Wen-Syan Li,et al.  String Similarity Joins: An Experimental Evaluation , 2014, Proc. VLDB Endow..