Geospatial Data Management in Apache Spark: A Tutorial

The volume of spatial data increases at a staggering rate. This tutorial comprehensively studies how existing works extend Apache Spark to uphold massive-scale spatial data. During this 1.5 hour tutorial, we first provide a background introduction of the characteristics of spatial data and the history of distributed data management systems. A follow-up section presents the common approaches used by the practitioners to extend Spark and introduces the vital components in a generic spatial data management system. The third, fourth and fifth sections then discuss the ongoing efforts and experience in spatial-temporal data, spatial data analytics and streaming spatial data, respectively. The sixth part finally concludes this tutorial to help the audience better grasp the overall content and points out future research directions.

[1]  Cyril Ray,et al.  Design principles of a stream-based framework for mobility analysis , 2016, GeoInformatica.

[2]  Zhenlong Li,et al.  A spatiotemporal indexing approach for efficient processing of big array-based climate data with MapReduce , 2017, Int. J. Geogr. Inf. Sci..

[3]  Georg Fuchs,et al.  A Distributed Online Learning Approach for Pattern Prediction over Movement Event Streams with Apache Flink , 2018, EDBT/ICDT Workshops.

[4]  Andreas Kipf,et al.  How Good Are Modern Spatial Analytics Systems? , 2018, Proc. VLDB Endow..

[5]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[6]  Jia Yu,et al.  Spatial data management in apache spark: the GeoSpark perspective and beyond , 2018, GeoInformatica.

[7]  Jia Yu,et al.  A demonstration of GeoSpark: A cluster computing framework for processing big spatial data , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[8]  Mohamed Sarwat,et al.  GeoSpark: a cluster computing framework for processing large-scale spatial data , 2015, SIGSPATIAL/GIS.

[9]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[10]  Chris Mattmann,et al.  SciSpark: Applying in-memory distributed computing to weather event detection and tracking , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[11]  Joel H. Saltz,et al.  SparkGIS: Resource Aware Efficient In-Memory Spatial Query Processing , 2017, SIGSPATIAL/GIS.

[12]  Minyi Guo,et al.  Simba: Efficient In-Memory Spatial Analytics , 2016, SIGMOD Conference.

[13]  Ralf Hartmut Güting,et al.  Parallel Secondo: Boosting Database Engines with Hadoop , 2012, 2012 IEEE 18th International Conference on Parallel and Distributed Systems.

[14]  Joel H. Saltz,et al.  Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce , 2013, Proc. VLDB Endow..

[15]  Lionel M. Ni,et al.  CloST: a hadoop-based storage system for big spatio-temporal data analytics , 2012, CIKM '12.

[16]  Divyakant Agrawal,et al.  MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services , 2011, 2011 IEEE 12th International Conference on Mobile Data Management.

[17]  Erik G. Hoel,et al.  Spatio-Temporal Join on Apache Spark , 2017, SIGSPATIAL/GIS.

[18]  Feifei Li,et al.  Distributed Trajectory Similarity Search , 2017, Proc. VLDB Endow..

[19]  Zhifeng Bao,et al.  DITA: Distributed In-Memory Trajectory Analytics , 2018, SIGMOD Conference.

[20]  Beng Chin Ooi,et al.  The performance of MapReduce , 2010, Proc. VLDB Endow..

[21]  Le Gruenwald,et al.  Large-scale spatial join query processing in Cloud , 2015, 2015 31st IEEE International Conference on Data Engineering Workshops.

[22]  Ahmed Eldawy,et al.  SpatialHadoop: A MapReduce framework for spatial data , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[23]  Christopher N. Eichelberger,et al.  GeoMesa: a distributed architecture for spatio-temporal fusion , 2015, Defense + Security Symposium.

[24]  Mohamed F. Mokbel,et al.  ST-Hadoop: a MapReduce framework for spatio-temporal data , 2017, GeoInformatica.

[25]  Jia Yu,et al.  GeoSparkViz: a scalable geospatial data visualization framework in the apache spark ecosystem , 2018, SSDBM.

[26]  Walid G. Aref,et al.  LocationSpark: A Distributed In-Memory Data Management System for Big Spatial Data , 2016, Proc. VLDB Endow..

[27]  Reynold Xin,et al.  Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark , 2018, SIGMOD Conference.