ST-Hadoop: a MapReduce framework for spatio-temporal data

This paper presents ST-Hadoop; the first full-fledged open-source MapReduce framework with a native support for spatio-temporal data. ST-Hadoop is a comprehensive extension to Hadoop and SpatialHadoop that injects spatio-temporal data awareness inside each of their layers, mainly, language, indexing, and operations layers. In the language layer, ST-Hadoop provides built in spatio-temporal data types and operations. In the indexing layer, ST-Hadoop spatiotemporally loads and divides data across computation nodes in Hadoop Distributed File System in a way that mimics spatio-temporal index structures, which result in achieving orders of magnitude better performance than Hadoop and SpatialHadoop when dealing with spatio-temporal data and queries. In the operations layer, ST-Hadoop shipped with support for three fundamental spatio-temporal queries, namely, spatio-temporal range, top-k nearest neighbor, and join queries. Extensibility of ST-Hadoop allows others to extend features and operations easily using similar approaches described in the paper. Extensive experiments conducted on large-scale dataset of size 10 TB that contains over 1 Billion spatio-temporal records, to show that ST-Hadoop achieves orders of magnitude better performance than Hadoop and SpaitalHadoop when dealing with spatio-temporal data and operations. The key idea behind the performance gained in ST-Hadoop is its ability in indexing spatio-temporal data within Hadoop Distributed File System.

[1]  Divyakant Agrawal,et al.  $\mathcal{MD}$-HBase: design and implementation of an elastic data infrastructure for cloud-scale location services , 2012, Distributed and Parallel Databases.

[2]  Mohamed F. Mokbel,et al.  ST-Hadoop: A MapReduce Framework for Spatio-Temporal Data , 2017, SSTD.

[3]  Michael A. Whitby,et al.  GeoWave: Utilizing Distributed Key-Value Stores for Multidimensional Data , 2017, SSTD.

[4]  Michael Stonebraker,et al.  SciDB: A Database Management System for Applications with Complex Analytics , 2013, Computing in Science & Engineering.

[5]  Ahmed Eldawy,et al.  SpatialHadoop: A MapReduce framework for spatial data , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[6]  Lionel M. Ni,et al.  CloST: a hadoop-based storage system for big spatio-temporal data analytics , 2012, CIKM '12.

[7]  Erik G. Hoel,et al.  Spatial indexing and analytics on Hadoop , 2014, SIGSPATIAL/GIS.

[8]  Ahmed Eldawy,et al.  Pigeon: A spatial MapReduce language , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[9]  Xun Wang,et al.  Behavioral simulations in MapReduce , 2010, Proc. VLDB Endow..

[10]  Mohamed Sarwat,et al.  GeoSpark: a cluster computing framework for processing large-scale spatial data , 2015, SIGSPATIAL/GIS.

[11]  Ralf Hartmut Güting,et al.  Parallel Secondo: Boosting Database Engines with Hadoop , 2012, 2012 IEEE 18th International Conference on Parallel and Distributed Systems.

[12]  Xiaofeng Meng,et al.  An efficient multi-dimensional index for cloud data management , 2009, CloudDB@CIKM.

[13]  Yoshiharu Ishikawa,et al.  Processing All k-Nearest Neighbor Queries in Hadoop , 2012, WAIM.

[14]  Joel H. Saltz,et al.  Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce , 2013, Proc. VLDB Endow..

[15]  Markus Schneider,et al.  Spatio-Temporal Predicates , 2002, IEEE Trans. Knowl. Data Eng..

[16]  Graham Sammells The data challenge , 2013 .

[17]  Zhenlong Li,et al.  A spatiotemporal indexing approach for efficient processing of big array-based climate data with MapReduce , 2017, Int. J. Geogr. Inf. Sci..

[18]  Aoying Zhou,et al.  Query processing of massive trajectory data based on mapreduce , 2009, CloudDB@CIKM.

[19]  Xiaomin Zhu,et al.  Elastic and effective spatio-temporal query processing scheme on Hadoop , 2012, BigSpatial '12.

[20]  Latifur Khan,et al.  GISQF: An Efficient Spatial Query Processing System , 2014, 2014 IEEE 7th International Conference on Cloud Computing.

[21]  Gang Chen,et al.  ScalaGiST: Scalable Generalized Search Trees for MapReduce Systems [Innovative Systems Paper] , 2014, Proc. VLDB Endow..

[22]  Ahmed Eldawy,et al.  SHAHED: A MapReduce-based system for querying and visualizing spatio-temporal satellite data , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[23]  Kai Wang,et al.  Spatial Queries Evaluation with MapReduce , 2009, 2009 Eighth International Conference on Grid and Cooperative Computing.

[24]  Thomas Seidl,et al.  PHiDJ: Parallel similarity self-join for high-dimensional vector data with MapReduce , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[25]  Christopher N. Eichelberger,et al.  Spatio-temporal indexing in non-relational distributed databases , 2013, 2013 IEEE International Conference on Big Data.

[26]  Ralf Rantzau,et al.  Cost-Based Predictive Spatiotemporal Join , 2009, IEEE Transactions on Knowledge and Data Engineering.

[27]  Ming-Ling Lo,et al.  Spatial hash-joins , 1996, SIGMOD '96.

[28]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.