Efficient large-scale distance-based join queries in spatialhadoop

Efficient processing of Distance-Based Join Queries (DBJQs) in spatial databases is of paramount importance in many application domains. The most representative and known DBJQs are the K Closest Pairs Query (KCPQ) and the ε Distance Join Query (εDJQ). These types of join queries are characterized by a number of desired pairs (K) or a distance threshold (ε) between the components of the pairs in the final result, over two spatial datasets. Both are expensive operations, since two spatial datasets are combined with additional constraints. Given the increasing volume of spatial data originating from multiple sources and stored in distributed servers, it is not always efficient to perform DBJQs on a centralized server. For this reason, this paper addresses the problem of computing DBJQs on big spatial datasets in SpatialHadoop, an extension of Hadoop that supports efficient processing of spatial queries in a cloud-based setting. We propose novel algorithms, based on plane-sweep, to perform efficient parallel DBJQs on large-scale spatial datasets in SpatialHadoop. We evaluate the performance of the proposed algorithms in several situations with large real-world as well as synthetic datasets. The experiments demonstrate the efficiency and scalability of our proposed methodologies.

[1]  Hanan Samet,et al.  Incremental distance join algorithms for spatial databases , 1998, SIGMOD '98.

[2]  Yannis Manolopoulos,et al.  Processing Distance Join Queries with Constraints , 2006, Comput. J..

[3]  Younghoon Kim,et al.  Parallel Top-K Similarity Join Algorithms Using MapReduce , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[4]  Kyuseok Shim,et al.  Parallel Computation of Skyline and Reverse Skyline Queries Using MapReduce , 2013, Proc. VLDB Endow..

[5]  Feifei Li,et al.  Efficient parallel kNN joins for large data in MapReduce , 2012, EDBT '12.

[6]  Mario A. Nascimento,et al.  K-Closest Pairs Queries in Road Networks , 2016, 2016 17th IEEE International Conference on Mobile Data Management (MDM).

[7]  Chris Mattmann,et al.  SciSpark: Applying in-memory distributed computing to weather event detection and tracking , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[8]  David J. DeWitt,et al.  Partition based spatial-merge join , 1996, SIGMOD '96.

[9]  Yannis Manolopoulos,et al.  New plane-sweep algorithms for distance-based join queries in spatial databases , 2016, GeoInformatica.

[10]  Rajeev Motwani,et al.  On random sampling over joins , 1999, SIGMOD '99.

[11]  Zhiyang Li,et al.  Scalable nearest neighbor query processing based on Inverted Grid Index , 2014, J. Netw. Comput. Appl..

[12]  Zhiyong Xu,et al.  SJMR: Parallelizing spatial join with MapReduce on clusters , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[13]  Dimitrios Tsoumakos,et al.  kdANN+: A Rapid AkNN Classifier for Big Data , 2016, Trans. Large Scale Data Knowl. Centered Syst..

[14]  Yannis Manolopoulos,et al.  A New Plane-Sweep Algorithm for the K-Closest-Pairs Query , 2014, SOFSEM.

[15]  Ahmed Eldawy,et al.  Spatial Partitioning Techniques in Spatial Hadoop , 2015, Proc. VLDB Endow..

[16]  Hanan Samet,et al.  Applications of spatial data structures - computer graphics, image processing, and GIS , 1990 .

[17]  Yasin N. Silva,et al.  Exploiting MapReduce-based similarity joins , 2012, SIGMOD Conference.

[18]  Michael Vassilakopoulos,et al.  On Approximate Algorithms for Distance-Based Queries using R-trees , 2005, Comput. J..

[19]  Farnoush Banaei Kashani,et al.  Voronoi-Based Geospatial Query Processing with MapReduce , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[20]  Minyi Guo,et al.  Simba: Efficient In-Memory Spatial Analytics , 2016, SIGMOD Conference.

[21]  Christos Doulkeridis,et al.  Efficient skyline query processing in SpatialHadoop , 2015, Inf. Syst..

[22]  Wenfei Fan,et al.  Conditional functional dependencies for capturing data inconsistencies , 2008, TODS.

[23]  Carlos Maltzahn,et al.  SciHadoop: Array-based query processing in Hadoop , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[24]  Fusheng Wang,et al.  Effective Spatial Data Partitioning for Scalable Query Processing , 2015, ArXiv.

[25]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[26]  Hongzhi Wang,et al.  Parallel trajectory search based on distributed index , 2017, Inf. Sci..

[27]  Xiao Qin,et al.  Efficient Parallel Skyline Evaluation Using MapReduce , 2016, IEEE Transactions on Parallel and Distributed Systems.

[28]  Walid G. Aref,et al.  LocationSpark: A Distributed In-Memory Data Management System for Big Spatial Data , 2016, Proc. VLDB Endow..

[29]  Fusheng Wang,et al.  SATO: a spatial data partitioning framework for scalable query processing , 2014, SIGSPATIAL/GIS.

[30]  Beng Chin Ooi,et al.  Efficient Processing of k Nearest Neighbor Joins using MapReduce , 2012, Proc. VLDB Endow..

[31]  Shashi Shekhar,et al.  Spatial Databases: A Tour , 2003 .

[32]  Man Lung Yiu,et al.  Computation and Monitoring of Exclusive Closest Pairs , 2008, IEEE Transactions on Knowledge and Data Engineering.

[33]  Yannis Manolopoulos,et al.  Closest pair queries in spatial databases , 2000, SIGMOD '00.

[34]  Aoying Zhou,et al.  Query processing of massive trajectory data based on mapreduce , 2009, CloudDB@CIKM.

[35]  Zhenlong Li,et al.  A high performance query analytical framework for supporting data-intensive climate studies , 2017, Comput. Environ. Urban Syst..

[36]  Yannis Manolopoulos,et al.  Enhancing SpatialHadoop with Closest Pair Queries , 2016, ADBIS.

[37]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[38]  Abdullah Gani,et al.  A survey on indexing techniques for big data: taxonomy and performance evaluation , 2016, Knowledge and Information Systems.

[39]  Kai Wang,et al.  Spatial Queries Evaluation with MapReduce , 2009, 2009 Eighth International Conference on Grid and Cooperative Computing.

[40]  Ahmed Eldawy,et al.  SpatialHadoop: A MapReduce framework for spatial data , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[41]  Mario A. López,et al.  STR: a simple and efficient algorithm for R-tree packing , 1997, Proceedings 13th International Conference on Data Engineering.

[42]  Agnès Voisard,et al.  Spatial databases - with applications to GIS , 2002 .

[43]  Jochen Schiller,et al.  Location Based Services , 2004 .

[44]  Ralf Hartmut Güting,et al.  Parallel Secondo: Boosting Database Engines with Hadoop , 2012, 2012 IEEE 18th International Conference on Parallel and Distributed Systems.

[45]  Xuan Song,et al.  Accelerating Spatial Data Processing with MapReduce , 2010, 2010 IEEE 16th International Conference on Parallel and Distributed Systems.

[46]  Beng Chin Ooi,et al.  Distributed data management using MapReduce , 2014, CSUR.

[47]  King-Ip Lin,et al.  An index structure for improving closest pairs and related join queries in spatial databases , 2002, Proceedings International Database Engineering and Applications Symposium.

[48]  C. L. Philip Chen,et al.  Data-intensive applications, challenges, techniques and technologies: A survey on Big Data , 2014, Inf. Sci..

[49]  Michael Ian Shamos,et al.  Computational geometry: an introduction , 1985 .

[50]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[51]  Roberto Giachetta,et al.  A framework for processing large scale geospatial and remote sensing data in MapReduce environment , 2015, Comput. Graph..

[52]  Ahmed Eldawy,et al.  SHAHED: A MapReduce-based system for querying and visualizing spatio-temporal satellite data , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[53]  Ahmed Eldawy,et al.  CG_Hadoop: computational geometry in MapReduce , 2013, SIGSPATIAL/GIS.

[54]  L. Venkata Subramaniam,et al.  Processing multi-way spatial joins on map-reduce , 2013, EDBT '13.

[55]  Hanan Samet,et al.  Metric space similarity joins , 2008, TODS.

[56]  Joel H. Saltz,et al.  Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce , 2013, Proc. VLDB Endow..

[57]  Sukho Lee,et al.  Adaptive and Incremental Processing for Distance Join Queries , 2003, IEEE Trans. Knowl. Data Eng..

[58]  Ronald L. Rivest,et al.  Introduction to Algorithms, third edition , 2009 .

[59]  Yannis Manolopoulos,et al.  Multi-Way Distance Join Queries in Spatial Databases , 2004, GeoInformatica.

[60]  Yannis Manolopoulos,et al.  Algorithms for processing K-closest-pair queries in spatial databases , 2004, Data Knowl. Eng..

[61]  Isaac Triguero,et al.  A MapReduce-Based k-Nearest Neighbor Approach for Big Data Classification , 2015, 2015 IEEE Trustcom/BigDataSE/ISPA.

[62]  Christos Doulkeridis,et al.  A survey of large-scale analytical query processing in MapReduce , 2013, The VLDB Journal.

[63]  Mohamed Sarwat,et al.  GeoSpark: a cluster computing framework for processing large-scale spatial data , 2015, SIGSPATIAL/GIS.

[64]  DoulkeridisChristos,et al.  A survey of large-scale analytical query processing in MapReduce , 2014, VLDB 2014.

[65]  Dimitris Papadias,et al.  Multiway spatial joins , 2001, ACM Trans. Database Syst..

[66]  Gilberto Gutierrez,et al.  The k closest pairs in spatial databases , 2012, GeoInformatica.

[67]  Le Gruenwald,et al.  Large-scale spatial join query processing in Cloud , 2015, 2015 31st IEEE International Conference on Data Engineering Workshops.

[68]  Gang Chen,et al.  Efficient $$k$$k-closest pair queries in general metric spaces , 2015, The VLDB Journal.

[69]  Chen Wang,et al.  Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics , 2015, Proc. VLDB Endow..

[70]  Jesús Manuel Almendros-Jiménez,et al.  A performance comparison of distance-based query algorithms using R-trees in spatial databases , 2007, Inf. Sci..