Top-k Spatio-Textual Similarity Join

With the development of location-based services (LBS), LBS users are generating more and more spatio-textual data, e.g., checkins and attraction reviews. Since a spatio-textual entity may have different representations, possibly due to GPS deviations or typographical errors, it calls for effective methods to integrate the spatio-textual data from different data sources. In this paper, we study the problem of top-<inline-formula><tex-math>$k$</tex-math> <alternatives><inline-graphic xlink:type="simple" xlink:href="li-ieq1-2485213.gif"/></alternatives></inline-formula> spatio-textual similarity join (<sc>Topk-STJoin</sc>), which identifies the <inline-formula><tex-math>$k$</tex-math><alternatives> <inline-graphic xlink:type="simple" xlink:href="li-ieq2-2485213.gif"/></alternatives></inline-formula> most similar pairs from two spatio-textual data sets. One big challenge in <sc>Topk-STJoin</sc> is to efficiently identify the top-<inline-formula> <tex-math>$k$</tex-math><alternatives><inline-graphic xlink:type="simple" xlink:href="li-ieq3-2485213.gif"/></alternatives></inline-formula> similar pairs by considering both textual relevancy and spatial proximity. Traditional join algorithms that consider only one dimension (textual or spatial) are inefficient because they cannot utilize the pruning ability on the other dimension. To address this challenge, we propose a signature-based top-<inline-formula><tex-math>$k$</tex-math> <alternatives><inline-graphic xlink:type="simple" xlink:href="li-ieq4-2485213.gif"/></alternatives></inline-formula> join framework. We first generate a spatio-textual signature set for each object such that if two objects are in the top-<inline-formula> <tex-math>$k$</tex-math><alternatives><inline-graphic xlink:type="simple" xlink:href="li-ieq5-2485213.gif"/></alternatives></inline-formula> similar pairs, their signature sets must overlap. With this property, we can prune large numbers of dissimilar pairs without common signatures. We find that the order of accessing the signatures has a significant effect on the performance. So, we compute an upper bound for each signature and propose a best-first accessing method that preferentially accesses signatures with large upper bounds while those pairs with small upper bounds can be pruned. We prove the optimality of our best-first accessing method. Next, we optimize the spatio-textual signatures and propose progressive signatures to further improve the pruning power. Experimental results on real-world datasets show that our algorithm achieves high performance and good scalability, and significantly outperforms baseline approaches.

[1]  Guoliang Li,et al.  PASS-JOIN: A Partition-based Method for Similarity Joins , 2011, Proc. VLDB Endow..

[2]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[3]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[4]  Nikos Mamoulis,et al.  Spatio-textual similarity joins , 2012, Proc. VLDB Endow..

[5]  Yannis Manolopoulos,et al.  Closest pair queries in spatial databases , 2000, SIGMOD '00.

[6]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[7]  Anthony K. H. Tung,et al.  Scalable top-k spatial keyword search , 2013, EDBT '13.

[8]  Guoliang Li,et al.  MassJoin: A mapreduce-based method for scalable string similarity joins , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[9]  Yang Wang,et al.  Location-aware publish/subscribe , 2013, KDD.

[10]  Guoliang Li,et al.  Can we beat the prefix filtering?: an adaptive framework for similarity join and search , 2012, SIGMOD Conference.

[11]  Sukho Lee,et al.  Adaptive multi-stage distance join processing , 2000, SIGMOD '00.

[12]  Naphtali Rishe,et al.  Keyword Search on Spatial Databases , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[13]  Thomas Heinis,et al.  TOUCH: in-memory spatial join by hierarchical data-oriented partitioning , 2013, SIGMOD '13.

[14]  Hanan Samet,et al.  Incremental distance join algorithms for spatial databases , 1998, SIGMOD '98.

[15]  João B. Rocha-Junior,et al.  Efficient Processing of Top-k Spatial Keyword Queries , 2011, SSTD.

[16]  Dieter Pfoser,et al.  Extraction, integration and exploration of crowdsourced geospatial content from multiple web sources , 2014, SIGSPATIAL/GIS.

[17]  Xuemin Lin,et al.  Top-k Set Similarity Joins , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[18]  Zhifeng Bao,et al.  Top-k Spatio-Textual Similarity Join , 2016, IEEE Trans. Knowl. Data Eng..

[19]  Yiqun Liu,et al.  A location-aware publish/subscribe framework for parameterized spatio-textual subscriptions , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[20]  Nikos Mamoulis,et al.  Efficient Top-k Spatial Distance Joins , 2013, SSTD.

[21]  Guoliang Li,et al.  Star-Join: spatio-textual similarity join , 2012, CIKM '12.

[22]  Christian S. Jensen,et al.  Spatial Keyword Query Processing: An Experimental Evaluation , 2013, Proc. VLDB Endow..

[23]  Mark Allen Weiss,et al.  Data Structures , 2014, Computing Handbook, 3rd ed..

[24]  Wen-Syan Li,et al.  String Similarity Joins: An Experimental Evaluation , 2014, Proc. VLDB Endow..

[25]  Christian S. Jensen,et al.  Joint Top-K Spatial Keyword Query Processing , 2012, IEEE Transactions on Knowledge and Data Engineering.

[26]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near duplicate detection , 2008, WWW.

[27]  Christian S. Jensen,et al.  Efficient Retrieval of the Top-k Most Relevant Spatial Web Objects , 2009, Proc. VLDB Endow..

[28]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[29]  Sartaj Sahni,et al.  Data Structures, Algorithms, and Applications in C++ , 1997 .

[30]  Guoliang Li,et al.  A Prefix-Filter based Method for Spatio-Textual Similarity Join , 2014, IEEE Transactions on Knowledge and Data Engineering.

[31]  Ken C. K. Lee,et al.  IR-Tree: An Efficient Index for Geographic Document Search , 2011, IEEE Trans. Knowl. Data Eng..

[32]  ManolopoulosYannis,et al.  Closest pair queries in spatial databases , 2000 .

[33]  Ting Wang,et al.  Efficient Filtering Algorithms for Location-Aware Publish/Subscribe , 2015, IEEE Transactions on Knowledge and Data Engineering.

[34]  Xuemin Lin,et al.  Inverted Linear Quadtree: Efficient Top K Spatial Keyword Search , 2016, IEEE Transactions on Knowledge and Data Engineering.

[35]  Jing Xu,et al.  DESKS: Direction-Aware Spatial Keyword Search , 2012, 2012 IEEE 28th International Conference on Data Engineering.