Big Geospatial Data Processing Made Easy: A Working Guide to GeoSpark

In the past decade, the volume of available geospatial data increased tremendously. Such data includes but not limited to: weather maps, socio-economic data, and geo-tagged social media. Moreover, the unprecedented popularity of GPS-equipped mobile devices and Internet of Things (IoT) sensors has led to continuously generating large-scale location information combined with the status of surrounding environments. For example, several cities have started installing sensors across the road intersections to monitor the environment, traffic and air quality. Making sense of the rich geospatial properties hidden in the data may greatly transform our society. This includes many subjects undergoing intense study: (1) Climate analysis: that includes climate change analysis (N. R. C. Committee on the Science of Climate Change 2001), study of deforestation (Zeng et al. 1996), population migration (Chen et al. 1999), and variation in sea levels (Woodworth et al. 2011), (2) Urban planning: assisting government in city/regional planning, road network design, and transportation/traffic engineering, (3) Commerce and advertisement (Dhar and Varshney 2011): e.g., point-of-interest (POI) recommendation services. These data-intensive spatial analytics applications highly rely on the underlying database management systems (DBMSs) to efficiently manipulate, retrieve and manage data.

[1]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[2]  Julia Dmitrieva,et al.  Population Migration and the Variation of Dopamine D4 Receptor (DRD4) Allele Frequencies Around the Globe , 1999 .

[3]  Peter J. Bickel,et al.  S: An Interactive Environment for Data Analysis and Graphics , 1984 .

[4]  P. Woodworth,et al.  Erratum to: Evidence for Century-Timescale Acceleration in Mean Sea Levels and for Recent Changes in Extreme Sea Levels , 2011 .

[5]  Robert E. Dickinson,et al.  Climatic impact of Amazon deforestation: a mechanistic model study , 1996 .

[6]  Minyi Guo,et al.  Simba: Efficient In-Memory Spatial Analytics , 2016, SIGMOD Conference.

[7]  Ahmed Eldawy,et al.  SpatialHadoop: A MapReduce framework for spatial data , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[8]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[9]  Jia Yu,et al.  Two Birds, One Stone: A Fast, yet Lightweight, Indexing Scheme for Modern Database Systems , 2016, Proc. VLDB Endow..

[10]  Jia Yu,et al.  Indexing the Pickup and Drop-Off Locations of NYC Taxi Trips in PostgreSQL - Lessons from the Road , 2017, SSTD.

[11]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[12]  Howard Gobioff,et al.  The Google file system , 2003, SOSP '03.

[13]  Bernd-Uwe Pagel,et al.  Towards an analysis of range query performance in spatial data structures , 1993, PODS '93.

[14]  Jia Yu,et al.  Spatial data management in apache spark: the GeoSpark perspective and beyond , 2018, GeoInformatica.

[15]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[16]  David J. DeWitt,et al.  Partition based spatial-merge join , 1996, SIGMOD '96.

[17]  Ross Ihaka,et al.  Gentleman R: R: A language for data analysis and graphics , 1996 .

[18]  Upkar Varshney,et al.  Challenges and business models for mobile location-based services and advertising , 2011, Commun. ACM.

[19]  Ahmed Eldawy,et al.  Pigeon: A spatial MapReduce language , 2014, 2014 IEEE 30th International Conference on Data Engineering.