Efficient Spatio-textual Similarity Join Using MapReduce

Spatio-textual similarity join is a basic and significant operation in many applications. It is an operation that finds all the similar pairs of objects which have similar textual descriptions and are spatially close to each other. With the popularity of GPS and their applications, the size of spatiotextual data is increasing explosively, while the existing methods cannot deal with the spatio-textual similarity join efficiently on massive data. In this paper, we propose several approaches for spatio-textual similarity join using MapReduce. We use the prefix filtering and grid partitioning techniques to filter the spatiotextual objects under the filter-and-refine framework. Besides, we propose two kinds of optimization methods to improve the efficiency of the basic spatio-textual similarity join method. In the end, we conduct extensive experiments using several synthetic datasets that are comprised of real datasets, and the results show that our approaches have good performance in both efficiency and scalability.

[1]  Mark Sanderson,et al.  Spatio-textual Indexing for Geographical Search on the Web , 2005, SSTD.

[2]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[3]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[4]  Nikos Mamoulis,et al.  Spatio-textual similarity joins , 2012, Proc. VLDB Endow..

[5]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[6]  Naphtali Rishe,et al.  Keyword Search on Spatial Databases , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[7]  Chen Li,et al.  Processing Spatial-Keyword (SK) Queries in Geographic Information Retrieval (GIR) Systems , 2007, 19th International Conference on Scientific and Statistical Database Management (SSDBM 2007).

[8]  Torsten Suel,et al.  Text vs. space: efficient geo-search query processing , 2011, CIKM '11.

[9]  Christian S. Jensen,et al.  Efficient Retrieval of the Top-k Most Relevant Spatial Web Objects , 2009, Proc. VLDB Endow..

[10]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[11]  Guoliang Li,et al.  Star-Join: spatio-textual similarity join , 2012, CIKM '12.

[12]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[13]  Christos Faloutsos,et al.  V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors , 2012, Proc. VLDB Endow..

[14]  Ken C. K. Lee,et al.  IR-Tree: An Efficient Index for Geographic Document Search , 2011, IEEE Trans. Knowl. Data Eng..

[15]  João B. Rocha-Junior,et al.  Efficient Processing of Top-k Spatial Keyword Queries , 2011, SSTD.

[16]  Torsten Suel,et al.  Efficient query processing in geographic web search engines , 2006, SIGMOD Conference.