What Makes Spatial Data Big? A Discussion on How to Partition Spatial Data

The amount of available spatial data has significantly increased in the last years so that traditional analysis tools have become inappropriate to effectively manage them. Therefore, many attempts have been made in order to define extensions of existing MapReduce tools, such as Hadoop or Spark, with spatial capabilities in terms of data types and algorithms. Such extensions are mainly based on the partitioning techniques implemented for textual data where the dimension is given in terms of the number of occupied bytes. However, spatial data are characterized by other features which describe their dimension, such as the number of vertices or the MBR size of geometries, which greatly affect the performance of operations, like the spatial join, during data analysis. The result is that the use of traditional partitioning techniques prevents to completely exploit the benefit of the parallel execution provided by a MapReduce environment. This paper extensively analyses the problem considering the spatial join operation as use case, performing both a theoretical and an experimental analysis for it. Moreover, it provides a solution based on a different partitioning technique, which splits complex or extensive geometries. Finally, we validate the proposed solution by means of some experiments on synthetic and real datasets. 2012 ACM Subject Classification Information systems → Geographic information systems

[1]  Ahmed Eldawy,et al.  Spatial Join with Hadoop , 2017, Encyclopedia of GIS.

[2]  Ahmed Eldawy,et al.  Pigeon: A spatial MapReduce language , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[3]  Masaru Kitsuregawa,et al.  Bucket Spreading Parallel Hash: A New, Robust, Parallel Hash Join Method for Data Skew in the Super Database Computer (SDC) , 1990, VLDB.

[4]  Giuseppe Pelagatti,et al.  Validation of spatial integrity constraints in city models , 2015, MobiGIS.

[5]  Pietro Michiardi,et al.  PSBS: Practical Size-Based Scheduling , 2014, IEEE Transactions on Computers.

[6]  David J. DeWitt,et al.  Practical Skew Handling in Parallel Joins , 1992, VLDB.

[7]  Alberto Belussi,et al.  A cost model for spatial join operations in SpatialHadoop , 2018 .

[8]  Ahmed Eldawy,et al.  SpatialHadoop: A MapReduce framework for spatial data , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[9]  Xiaofang Zhou,et al.  Data Partitioning for Parallel Spatial Join Processing , 1997, GeoInformatica.

[10]  Kien A. Hua,et al.  Handling Data Skew in Multiprocessor Database Computers Using Partition Tuning , 1991, VLDB.

[11]  Reynold Xin,et al.  Apache Spark , 2016 .

[12]  Carra Damiano,et al.  Access-Time Aware Cache Algorithms , 2016, International Test Conference.

[13]  Pietro Michiardi,et al.  HFSP: Bringing Size-Based Scheduling To Hadoop , 2017, IEEE Transactions on Cloud Computing.

[14]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .