A Performance Study of Big Spatial Data Systems

With the accelerated growth in spatial data volume, being generated from a wide variety of sources, the need for efficient storage, retrieval, processing and analyzing of spatial data is ever more important. Hence, spatial data processing system has become an important field of research. In recent times a number of Big Spatial Data systems have been proposed by researchers around the world. These systems can be roughly categorized into Apache Hadoop-based and in-memory systems based on Apache Spark. The available features supported by these systems vary widely. However, there has not been any comprehensive evaluation study of these systems in terms of performance, scalability and functionality. To address this need, we propose a benchmark to evaluate Big Spatial Data systems. Although, Spark is a very popular framework, its performance is limited by the overhead associated with distributed resource management and coordination. The Big Spatial Data systems that are based on Spark, are also constrained by these. We introduce SpatialIgnite, a Big Spatial Data system that we have developed based on Apache Ignite. We investigate the present status of the Big Spatial Data systems by conducting a comprehensive feature analysis and performance evaluation of a few representative systems with our benchmark. Our study shows that SpatialIgnite performs better than Hadoop and Spark based systems that we have evaluated.

[1]  Ahmed Eldawy,et al.  Pigeon: A spatial MapReduce language , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[2]  Minyi Guo,et al.  Simba: Efficient In-Memory Spatial Analytics , 2016, SIGMOD Conference.

[3]  Le Gruenwald,et al.  Large-scale spatial join query processing in Cloud , 2015, 2015 31st IEEE International Conference on Data Engineering Workshops.

[4]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[5]  Yiran Chen,et al.  GeoSpark SQL: An Effective Framework Enabling Spatial Queries on Spark , 2017, ISPRS Int. J. Geo Inf..

[6]  Kai-Uwe Sattler,et al.  Big Spatial Data Processing Frameworks: Feature and Performance Evaluation , 2017, EDBT.

[7]  Suprio Ray,et al.  Jackpine: A benchmark to evaluate spatial database performance , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[8]  Michael Stonebraker,et al.  The SEQUOIA 2000 storage benchmark , 1993, SIGMOD '93.

[9]  Yannis Manolopoulos,et al.  Efficient query processing on large spatial databases: A performance study , 2017, J. Syst. Softw..

[10]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[11]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[12]  Kai-Uwe Sattler,et al.  The STARK Framework for Spatio-Temporal Data Analytics on Spark , 2017, BTW.

[13]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[14]  Syed Mohd Ali,et al.  Comparative analysis of SpatialHadoop and GeoSpark for geospatial big data analytics , 2016, 2016 2nd International Conference on Contemporary Computing and Informatics (IC3I).

[15]  Michael Isard,et al.  Scalability! But at what COST? , 2015, HotOS.

[16]  Walid G. Aref,et al.  LocationSpark: A Distributed In-Memory Data Management System for Big Spatial Data , 2016, Proc. VLDB Endow..

[17]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[18]  Suprio Ray,et al.  Surveying the landscape: an in-depth analysis of spatial database workloads , 2012, SIGSPATIAL/GIS.

[19]  Fusheng Wang,et al.  SATO: a spatial data partitioning framework for scalable query processing , 2014, SIGSPATIAL/GIS.

[20]  Mohamed Sarwat,et al.  GeoSpark: a cluster computing framework for processing large-scale spatial data , 2015, SIGSPATIAL/GIS.

[21]  Reynold Xin,et al.  Apache Spark , 2016 .

[22]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[23]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[24]  Gerhard W. Dueck,et al.  DISTIL: a distributed in-memory data processing system for location-based services , 2018, SIGSPATIAL/GIS.

[25]  Michael Vassilakopoulos,et al.  A Comparison of Distributed Spatial Data Management Systems for Processing Distance Join Queries , 2017, ADBIS.

[26]  Joel H. Saltz,et al.  Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce , 2013, Proc. VLDB Endow..

[27]  Ahmed Eldawy,et al.  SpatialHadoop: A MapReduce framework for spatial data , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[28]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[29]  Florence March,et al.  2016 , 2016, Affair of the Heart.

[30]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.