Spatial coding-based approach for partitioning big spatial data in Hadoop

Abstract Spatial data partitioning (SDP) plays a powerful role in distributed storage and parallel computing for spatial data. However, due to skew distribution of spatial data and varying volume of spatial vector objects, it leads to a significant challenge to ensure both optimal performance of spatial operation and data balance in the cluster. To tackle this problem, we proposed a spatial coding-based approach for partitioning big spatial data in Hadoop. This approach, firstly, compressed the whole big spatial data based on spatial coding matrix to create a sensing information set (SIS), including spatial code, size, count and other information. SIS was then employed to build spatial partitioning matrix, which was used to spilt all spatial objects into different partitions in the cluster finally. Based on our approach, the neighbouring spatial objects can be partitioned into the same block. At the same time, it also can minimize the data skew in Hadoop distributed file system (HDFS). The presented approach with a case study in this paper is compared against random sampling based partitioning, with three measurement standards, namely, the spatial index quality, data skew in HDFS, and range query performance. The experimental results show that our method based on spatial coding technique can improve the query performance of big spatial data, as well as the data balance in HDFS. We implemented and deployed this approach in Hadoop, and it is also able to support efficiently any other distributed big spatial data systems.

[1]  Yu Fang,et al.  A spatial data partition algorithm based on statistical cluster , 2011, 2011 19th International Conference on Geoinformatics.

[2]  Budiman Minasny,et al.  The variance quadtree algorithm: Use for spatial sampling design , 2007, Comput. Geosci..

[3]  Marios Hadjieleftheriou,et al.  SaIL: A Spatial Index Library for Efficient Application Integration , 2005, GeoInformatica.

[4]  G. Zanetti,et al.  Parallelizing bioinformatics applications with MapReduce , 2008 .

[5]  J. L. Smith,et al.  A data structure and algorithm based on a linear key for a rectangle retrieval problem , 1983, Comput. Vis. Graph. Image Process..

[6]  Walid G. Aref,et al.  Kangaroo: Workload-Aware Processing of Range Data and Range Queries in Hadoop , 2016, WSDM '16.

[7]  M. Goodchild,et al.  Data-driven geography , 2014, GeoJournal.

[8]  D. Hilbert,et al.  Relative growth rates and the grazing optimization hypothesis , 1981, Oecologia.

[9]  Peter J. Haas,et al.  Eagle-eyed elephant: split-oriented indexing in Hadoop , 2013, EDBT '13.

[10]  Naphtali Rishe,et al.  Experiences on Processing Spatial Data with MapReduce , 2009, SSDBM.

[11]  Di Wu,et al.  A k-d tree-based algorithm to parallelize Kriging interpolation of big spatial data , 2015 .

[12]  Ahmed Eldawy,et al.  SpatialHadoop: A MapReduce framework for spatial data , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[13]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[14]  Fusheng Wang,et al.  SATO: a spatial data partitioning framework for scalable query processing , 2014, SIGSPATIAL/GIS.

[15]  Jens-Michael Wierum,et al.  On the Quality of Partitions Based on Space-Filling Curves , 2002, International Conference on Computational Science.

[16]  Kenneth A. Hawick,et al.  Distributed frameworks and parallel algorithms for processing large-scale geographic data , 2003, Parallel Comput..

[17]  Changqing Huang,et al.  An improved Hilbert curve for parallel spatial data partitioning , 2007 .

[18]  Peter van Oosterom,et al.  The Spatial Location Code , 2010 .

[19]  R. Kitchin,et al.  Big Data, new epistemologies and paradigm shifts , 2014, Big Data Soc..

[20]  ChooKim-Kwang Raymond,et al.  Geographical information system parallelization for spatial big data processing , 2016 .

[21]  Ahmed Eldawy,et al.  A Demonstration of SpatialHadoop: An Efficient MapReduce Framework for Spatial Data , 2013, Proc. VLDB Endow..

[22]  Joel H. Saltz,et al.  Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce , 2013, Proc. VLDB Endow..

[23]  Ling Liu,et al.  Computing infrastructure for big data processing , 2013, Frontiers of Computer Science.

[24]  Vivek R. Narasayya,et al.  Integrating vertical and horizontal partitioning into automated physical database design , 2004, SIGMOD '04.

[25]  P. Bajerski,et al.  Optimization of geofield queries , 2008, 2008 1st International Conference on Information Technology.

[26]  David M. Mark,et al.  A Comparative Analysis of some 2-Dimensional Orderings , 1990, Int. J. Geogr. Inf. Sci..

[27]  Gerhard Weikum,et al.  Data partitioning and load balancing in parallel disk systems , 1998, The VLDB Journal.

[28]  Xinchang Zhang,et al.  A computing method for spatial accessibility based on grid partition , 2007, Geoinformatics.

[29]  Ahmed Eldawy,et al.  Spatial Partitioning Techniques in Spatial Hadoop , 2015, Proc. VLDB Endow..

[30]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[31]  Rajiv Ranjan,et al.  Geographical information system parallelization for spatial big data processing: a review , 2016, Cluster Computing.

[32]  Stanislaw Kozielski,et al.  Computational Model for Efficient Processing of Geofield Queries , 2009, ICMMI.