The STARK Framework for Spatio-Temporal Data Analytics on Spark

Big Data sets can contain all types of information: from server log files to tracking information of mobile users with their location at a point in time. Apache Spark has been widely accepted for Big Data analytics because of its very fast processing model. However, Spark has no native support for spatial or spatio-temporal data. Spatial filters or joins using, e.g., a contains predicate are not supported and would have to be implemented inefficiently by the users. Also, Spark cannot make use of, e.g., spatial distribution for optimal partitioning. Here we present our STARK framework that adds spatio-temporal support to Spark. It includes spatial partitioners, different modes for indexing, as well as filter, join, and clustering operators. In contrast to existing solutions, STARK integrates seamlessly into any (Scala) Spark program and provides more flexible and comprehensive operators. Furthermore, our experimental evaluation shows that our implementation outperforms existing solutions.

[1]  Ahmed Eldawy,et al.  SpatialHadoop: A MapReduce framework for spatial data , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[2]  Erik G. Hoel,et al.  Spatial indexing and analytics on Hadoop , 2014, SIGSPATIAL/GIS.

[3]  Mohamed Sarwat,et al.  GeoSpark: a cluster computing framework for processing large-scale spatial data , 2015, SIGSPATIAL/GIS.

[4]  Amit P. Sheth,et al.  Semantic (Web) Technology In Action: Ontology Driven Information Systems for Search, Integration and Analysis , 2003, IEEE Data Eng. Bull..

[5]  Walid G. Aref,et al.  Spatio-Temporal Access Methods , 2003, IEEE Data Eng. Bull..

[6]  Ahmed Eldawy,et al.  A Demonstration of SpatialHadoop: An Efficient MapReduce Framework for Spatial Data , 2013, Proc. VLDB Endow..

[7]  Joel H. Saltz,et al.  Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce , 2013, Proc. VLDB Endow..

[8]  Jia Yu,et al.  A demonstration of GeoSpark: A cluster computing framework for processing big spatial data , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[9]  Christopher N. Eichelberger,et al.  Spatio-temporal indexing in non-relational distributed databases , 2013, 2013 IEEE International Conference on Big Data.

[10]  Timos K. Sellis,et al.  Spatio-temporal indexing for large multimedia applications , 1996, Proceedings of the Third IEEE International Conference on Multimedia Computing and Systems.

[11]  Kai-Uwe Sattler,et al.  Piglet: Interactive and Platform Transparent Analytics for RDF & Dynamic Data , 2016, WWW.

[12]  Le Gruenwald,et al.  Large-scale spatial join query processing in Cloud , 2015, 2015 31st IEEE International Conference on Data Engineering Workshops.

[13]  Haoyu Tan,et al.  MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data , 2013, Frontiers of Computer Science.

[14]  Liang Chen,et al.  MapReduce Skyline Query Processing with a New Angular Partitioning Approach , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.