High-Performance Geospatial Big Data Processing System Based on MapReduce

With the rapid development of Internet of Things (IoT) technologies, the increasing volume and diversity of sources of geospatial big data have created challenges in storing, managing, and processing data. In addition to the general characteristics of big data, the unique properties of spatial data make the handling of geospatial big data even more complicated. To facilitate users implementing geospatial big data applications in a MapReduce framework, several big data processing systems have extended the original Hadoop to support spatial properties. Most of those platforms, however, have included spatial functionalities by embedding them as a form of plug-in. Although offering a convenient way to add new features to an existing system, the plug-in has several limitations. In particular, while executing spatial and nonspatial operations by alternating between the existing system and the plug-in, additional read and write overheads have to be added to the workflow, significantly reducing performance efficiency. To address this issue, we have developed Marmot, a high-performance, geospatial big data processing system based on MapReduce. Marmot extends Hadoop at a low level to support seamless integration between spatial and nonspatial operations of a solid framework, allowing improved performance of geoprocessing workflow. This paper explains the overall architecture and data model of Marmot as well as the main algorithm for automatic construction of MapReduce jobs from a given spatial analysis task. To illustrate how Marmot transforms a sequence of operators for spatial analysis to map and reduce functions in a way to achieve better performance, this paper presents an example of spatial analysis retrieving the number of subway stations per city in Korea. This paper also experimentally demonstrates that Marmot generally outperforms SpatialHadoop, one of the top plug-in based spatial big data frameworks, particularly in dealing with complex and time-intensive queries involving spatial index.

[1]  Jae-Gil Lee,et al.  Geospatial Big Data: Challenges and Opportunities , 2015, Big Data Res..

[2]  Ioannis Kanellopoulos,et al.  The European geoportal - one step towards the establishment of a European Spatial Data Infrastructure , 2005, Comput. Environ. Urban Syst..

[3]  Noureddine Hamdi,et al.  Spatial data extension for Cassandra NoSQL database , 2016, Journal of Big Data.

[4]  Jinjun Chen,et al.  A Time Efficient Approach for Detecting Errors in Big Sensor Data on Cloud , 2015, IEEE Transactions on Parallel and Distributed Systems.

[5]  Kang-Woo Lee,et al.  Marmot: A Hadoop-based High Performance Data Storage Management System for Processing Geospatial or Geo-Spatial Big Data , 2018 .

[6]  Ahmed Eldawy,et al.  SpatialHadoop: towards flexible and scalable spatial processing using mapreduce , 2014, SIGMOD'14 PhD Symposium.

[7]  Bin Jiang,et al.  Geospatial Big Data Handling Theory and Methods: A Review and Research Challenges , 2015, ArXiv.

[8]  Michael Manoochehri Data Just Right: Introduction to Large-Scale Data & Analytics , 2013 .

[9]  Min Deng,et al.  Handling multiple testing in local statistics of spatial association by controlling the False Discovery Rate: A comparative analysis , 2017, 2017 IEEE 2nd International Conference on Big Data Analysis (ICBDA)(.

[10]  Ahmed Eldawy,et al.  SpatialHadoop: A MapReduce framework for spatial data , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[11]  David J. DeWitt,et al.  Parallel database systems: the future of high performance database systems , 1992, CACM.

[12]  Joel H. Saltz,et al.  Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce , 2013, Proc. VLDB Endow..

[13]  Muthu Dayalan,et al.  MapReduce : Simplified Data Processing on Large Cluster , 2018 .

[14]  Li-Minn Ang,et al.  Big Sensor Data Systems for Smart Cities , 2017, IEEE Internet of Things Journal.

[15]  James Norris,et al.  Future trends in geospatial information management: the five to ten year vision , 2015 .

[16]  Paul A. Longley,et al.  The emergence of geoportals and their role in spatial data infrastructures , 2005, Comput. Environ. Urban Syst..

[17]  Jean Paul Isson,et al.  The Future of Analytics , 2012 .

[18]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[19]  S. Vasavi,et al.  Framework for Geospatial Query Processing by Integrating Cassandra with Hadoop , 2018 .

[20]  Ranga Raju Vatsavai,et al.  Spatiotemporal data mining in the era of big spatial data: algorithms and applications , 2012, BigSpatial '12.

[21]  Wenwen Li,et al.  Constructing gazetteers from volunteered Big Geo-Data based on Hadoop , 2013, Comput. Environ. Urban Syst..

[22]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[23]  Ahmed Eldawy,et al.  The ecosystem of SpatialHadoop , 2015, SIGSPACIAL.

[24]  Robert Jeansoulin,et al.  Review of Forty Years of Technological Changes in Geomatics toward the Big Data Paradigm , 2016, ISPRS Int. J. Geo Inf..

[25]  K. S. Rajan,et al.  Analyzing the performance of NoSQL vs. SQL databases for Spatial and Aggregate queries , 2017 .

[26]  R Parvathi,et al.  A Review on Spatial Big Data Analytics and Visualization , 2018 .