Large Scale Analytics of Vector+Raster Big Spatial Data

Significant increases in the volume of big spatial data have driven researchers and practitioners to build specialized systems to process and analyze this data. Existing efforts focus on either big raster data, e.g., remote sensing data or medical images, or big vector data, e.g., geotagged tweets or trajectories. However, when raster and vector data mix, one dataset must be converted to the other representation requiring vector-to-raster or raster-to-vector transformation before processing, which is extremely inefficient for large datasets. In this paper, we advocate a third approach that mixes the raw representations of both vector and raster data in the query processor. As a case study, we apply this to the zonal statistics problem, which computes the statistics over a raster layer for each polygon in a vector layer. We propose a novel method, called Scanline method, which does not require a conversion between raster and vector. Experimental evaluation on real datasets as large as 840 billion pixels shows up to three orders of magnitude speedup over the baseline methods.

[1]  Steven M. Manson,et al.  High performance analysis of big spatial data , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[2]  Minyi Guo,et al.  Simba: Efficient In-Memory Spatial Analytics , 2016, SIGMOD Conference.

[3]  G. D. Jenerette,et al.  Regional relationships between surface temperature, vegetation, and human settlement in a rapidly urbanizing ecosystem , 2007, Landscape Ecology.

[4]  Joel H. Saltz,et al.  Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce , 2013, Proc. VLDB Endow..

[5]  Gian Maria Pinna,et al.  The ESA Earth Observation Payload Data Long Term Storage Activities , 2009 .

[6]  Steven M. Manson,et al.  Terra Populus' architecture for integrated big geospatial services , 2017, Trans. GIS.

[7]  G. D. Jenerette,et al.  Ecosystem services and urban heat riskscape moderation: water, green spaces, and social inequality in Phoenix, USA. , 2011, Ecological applications : a publication of the Ecological Society of America.

[8]  Ahmed Eldawy,et al.  SHAHED: A MapReduce-based system for querying and visualizing spatio-temporal satellite data , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[9]  Divyakant Agrawal,et al.  $\mathcal{MD}$-HBase: design and implementation of an elastic data infrastructure for cloud-scale location services , 2012, Distributed and Parallel Databases.

[10]  Michael Stonebraker,et al.  SciDB: A Database Management System for Applications with Complex Analytics , 2013, Computing in Science & Engineering.

[11]  Mohamed Sarwat,et al.  GeoSpark: a cluster computing framework for processing large-scale spatial data , 2015, SIGSPATIAL/GIS.

[12]  Peter Baumann,et al.  The multidimensional database system RasDaMan , 1998, SIGMOD '98.

[13]  Ahmed Eldawy,et al.  SpatialHadoop: A MapReduce framework for spatial data , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[14]  Erik G. Hoel,et al.  Spatial indexing and analytics on Hadoop , 2014, SIGSPATIAL/GIS.