A New Design of High-Performance Large-Scale GIS Computing at a Finer Spatial Granularity: A Case Study of Spatial Join with Spark for Sustainability

Sustainability research faces many challenges as respective environmental, urban and regional contexts are experiencing rapid changes at an unprecedented spatial granularity level, which involves growing massive data and the need for spatial relationship detection at a faster pace. Spatial join is a fundamental method for making data more informative with respect to spatial relations. The dramatic growth of data volumes has led to increased focus on high-performance large-scale spatial join. In this paper, we present Spatial Join with Spark (SJS), a proposed high-performance algorithm, that uses a simple, but efficient, uniform spatial grid to partition datasets and joins the partitions with the built-in join transformation of Spark. SJS utilizes the distributed in-memory iterative computation of Spark, then introduces a calculation-evaluating model and in-memory spatial repartition technology, which optimize the initial partition by evaluating the calculation amount of local join algorithms without any disk access. We compare four in-memory spatial join algorithms in SJS for further performance improvement. Based on extensive experiments with real-world data, we conclude that SJS outperforms the Spark and MapReduce implementations of earlier spatial join approaches. This study demonstrates that it is promising to leverage high-performance computing for large-scale spatial join analysis. The availability of large-sized geo-referenced datasets along with the high-performance computing technology can raise great opportunities for sustainability research on whether and how these new trends in data and technology can be utilized to help detect the associated trends and patterns in the human-environment dynamics.

[1]  Sridhar Ramaswamy,et al.  Scalable Sweeping-Based Spatial Join , 1998, VLDB.

[2]  Xinyue Ye,et al.  Environmental Regulation, Economic Network and Sustainable Growth of Urban Agglomerations in China , 2016 .

[3]  Hao Hu,et al.  Using Web Crawler Technology for Geo-Events Analysis: A Case Study of the Huangyan Island Incident , 2014 .

[4]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[5]  David J. DeWitt,et al.  Clone join and shadow join: two parallel spatial join algorithms , 2000, GIS '00.

[6]  Minyi Guo,et al.  Simba: Efficient In-Memory Spatial Analytics , 2016, SIGMOD Conference.

[7]  Joel H. Saltz,et al.  Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce , 2013, Proc. VLDB Endow..

[8]  Christos Faloutsos,et al.  Hilbert R-tree: An Improved R-tree using Fractals , 1994, VLDB.

[9]  Hanan Samet,et al.  Iterative spatial join , 2003, TODS.

[10]  Shaowen Wang CyberGIS: blueprint for integrated and scalable geospatial software ecosystems , 2013, Int. J. Geogr. Inf. Sci..

[11]  Thomas Heinis,et al.  TOUCH: in-memory spatial join by hierarchical data-oriented partitioning , 2013, SIGMOD '13.

[12]  David J. Maguire,et al.  Geographic Information Science and Systems, 4/E. , 2016 .

[13]  Marios Hadjieleftheriou,et al.  R-Trees - A Dynamic Index Structure for Spatial Searching , 2008, ACM SIGSPATIAL International Workshop on Advances in Geographic Information Systems.

[14]  Hans-Peter Kriegel,et al.  Parallel processing of spatial joins using R-trees , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[15]  Hanan Samet,et al.  Spatial join techniques , 2007, TODS.

[16]  Min Li,et al.  GIS-Based Risk Assessment of Hail Disasters Affecting Cotton and Its Spatiotemporal Evolution in China , 2016 .

[17]  Yanwei Chai,et al.  Space–Time Behavior Research in China: Recent Development and Future Prospect , 2013 .

[18]  David J. DeWitt,et al.  Partition based spatial-merge join , 1996, SIGMOD '96.

[19]  Michael F Goodchild,et al.  Spatial Turn in Health Research , 2013, Science.

[20]  Michael F. Goodchild,et al.  Spatial cloud computing: how can the geospatial sciences use and help shape cloud computing? , 2011, Int. J. Digit. Earth.

[21]  Le Gruenwald,et al.  Spatial Join Query Processing in Cloud: Analyzing Design Choices and Performance Comparisons , 2015, 2015 44th International Conference on Parallel Processing Workshops.

[22]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[23]  Ryan Johnson,et al.  Skew-resistant parallel in-memory spatial join , 2014, SSDBM '14.

[24]  Chudong Huang,et al.  Spatial Modeling of Urban Vegetation and Land Surface Temperature: A Case Study of Beijing , 2015 .

[25]  Le Gruenwald,et al.  Large-scale spatial join query processing in Cloud , 2015, 2015 31st IEEE International Conference on Data Engineering Workshops.

[26]  J. Lee,et al.  Using Social Media for Emergency Response and Urban Sustainability: A Case Study of the 2012 Beijing Rainstorm , 2015 .

[27]  Ahmed Eldawy,et al.  SpatialHadoop: A MapReduce framework for spatial data , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[28]  Fusheng Wang,et al.  SATO: a spatial data partitioning framework for scalable query processing , 2014, SIGSPATIAL/GIS.

[29]  David L. Smith,et al.  Quantifying the Impact of Human Mobility on Malaria , 2012, Science.

[30]  武彦 福島 持続可能性(Sustainability)の要件 , 2006 .

[31]  Ryan Johnson,et al.  A parallel spatial data analysis infrastructure for the cloud , 2013, SIGSPATIAL/GIS.

[32]  Mario A. López,et al.  STR: a simple and efficient algorithm for R-tree packing , 1997, Proceedings 13th International Conference on Data Engineering.

[33]  Joel H. Saltz,et al.  SparkGIS: Efficient Comparison and Evaluation of Algorithm Results in Tissue Image Analysis Studies , 2015, Big-O/DMAH@VLDB.

[34]  Hanan Samet,et al.  Data-Parallel Spatial Join Algorithms , 1994, 1994 International Conference on Parallel Processing Vol. 3.

[35]  Mohamed Sarwat,et al.  GeoSpark: a cluster computing framework for processing large-scale spatial data , 2015, SIGSPATIAL/GIS.

[36]  Paul A. Longley,et al.  Geographic Information Science and Systems , 2015 .

[37]  Zhiyong Xu,et al.  SJMR: Parallelizing spatial join with MapReduce on clusters , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[38]  Bernhard Seeger,et al.  Data redundancy and duplicate detection in spatial join processing , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[39]  Ziheng Sun,et al.  Building an Elastic Parallel OGC Web Processing Service on a Cloud-Based Cluster: A Case Study of Remote Sensing Data Processing Service , 2015 .

[40]  Xiaofang Zhou,et al.  Data Partitioning for Parallel Spatial Join Processing , 1997, GeoInformatica.

[41]  Krzysztof Janowicz,et al.  Thematic signatures for cleansing and enriching place-related linked data , 2015, Int. J. Geogr. Inf. Sci..

[42]  Jeffrey F. Naughton,et al.  A non-blocking parallel spatial join algorithm , 2002, Proceedings 18th International Conference on Data Engineering.

[43]  Michael Vassilakopoulos,et al.  Join-Queries between Two Spatial Datasets Indexed by a Single R*-Tree , 2011, SOFSEM.