Automatic Scaling Hadoop in the Cloud for Efficient Process of Big Geospatial Data

Efficient processing of big geospatial data is crucial for tackling global and regional challenges such as climate change and natural disasters, but it is challenging not only due to the massive data volume but also due to the intrinsic complexity and high dimensions of the geospatial datasets. While traditional computing infrastructure does not scale well with the rapidly increasing data volume, Hadoop has attracted increasing attention in geoscience communities for handling big geospatial data. Recently, many studies were carried out to investigate adopting Hadoop for processing big geospatial data, but how to adjust the computing resources to efficiently handle the dynamic geoprocessing workload was barely explored. To bridge this gap, we propose a novel framework to automatically scale the Hadoop cluster in the cloud environment to allocate the right amount of computing resources based on the dynamic geoprocessing workload. The framework and auto-scaling algorithms are introduced, and a prototype system was developed to demonstrate the feasibility and efficiency of the proposed scaling mechanism using Digital Elevation Model (DEM) interpolation as an example. Experimental results show that this auto-scaling framework could (1) significantly reduce the computing resource utilization (by 80% in our example) while delivering similar performance as a full-powered cluster; and (2) effectively handle the spike processing workload by automatically increasing the computing resources to ensure the processing is finished within an acceptable time. Such an auto-scaling approach provides a valuable reference to optimize the performance of geospatial applications to address data- and computational-intensity challenges in GIScience in a more cost-efficient manner.

[1]  G. Nolan,et al.  Computational solutions to large-scale data management and analysis , 2010, Nature Reviews Genetics.

[2]  Shaowen Wang A CyberGIS Framework for the Synthesis of Cyberinfrastructure, GIS, and Spatial Analysis , 2010 .

[3]  Herodotos Herodotou,et al.  No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics , 2011, SoCC.

[4]  Geoffrey C. Fox,et al.  Cloud Computing and Spatial Cyberinfrastructure , 2010 .

[5]  Joel H. Saltz,et al.  Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce , 2013, Proc. VLDB Endow..

[6]  Vinit Kumar Baheti Windows Azure HDInsight: Where big data meets the cloud , 2014, 2014 Conference on IT in Business, Industry and Government (CSIBIG).

[7]  Qunying Huang,et al.  Using spatial principles to optimize distributed computing for enabling the physical science discoveries , 2011, Proceedings of the National Academy of Sciences.

[8]  Christoforos E. Kozyrakis,et al.  On the energy (in)efficiency of Hadoop clusters , 2010, OPSR.

[9]  Michael F. Goodchild,et al.  Spatial cloud computing: how can the geospatial sciences use and help shape cloud computing? , 2011, Int. J. Digit. Earth.

[10]  Vasudeva Varma,et al.  Dynamic energy efficient data placement and cluster reconfiguration algorithm for MapReduce framework , 2012, Future Gener. Comput. Syst..

[11]  Parijat Dube,et al.  Autoscaling for Hadoop Clusters , 2016, 2016 IEEE International Conference on Cloud Engineering (IC2E).

[12]  Zhenlong Li,et al.  Handling intensities of data, computation, concurrent access, and spatiotemporal patterns , 2013 .

[13]  M. Anusha,et al.  Big Data-Survey , 2016 .

[14]  Ming-Hsiang Tsou Big data: techniques and technologies in geoinformatics , 2014, Ann. GIS.

[15]  Zhenlong Li,et al.  Building Model as a Service to support geosciences , 2017, Comput. Environ. Urban Syst..

[16]  Ahmed Eldawy,et al.  A Demonstration of SpatialHadoop: An Efficient MapReduce Framework for Spatial Data , 2013, Proc. VLDB Endow..

[17]  Wenwen Li,et al.  Constructing gazetteers from volunteered Big Geo-Data based on Hadoop , 2013, Comput. Environ. Urban Syst..

[18]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[19]  C. Yang,et al.  Introduction to distributed geographic information processing research , 2009, Int. J. Geogr. Inf. Sci..

[20]  John L. Schnase,et al.  MERRA Analytic Services: Meeting the Big Data challenges of climate science through cloud-enabled Climate Analytics-as-a-Service , 2013, Comput. Environ. Urban Syst..

[21]  Zhenlong Li,et al.  Geospatial Service Web: towards integrated cyberinfrastructure for GIScience , 2012, Geo spatial Inf. Sci..

[22]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[23]  Ying Wu,et al.  Design strategies to improve performance of GIS Web services , 2004, International Conference on Information Technology: Coding and Computing, 2004. Proceedings. ITCC 2004..

[24]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[25]  Jae-Gil Lee,et al.  Geospatial Big Data: Challenges and Opportunities , 2015, Big Data Res..

[26]  Qunying Huang,et al.  Optimizing grid computing configuration and scheduling for geospatial analysis: An example with interpolating DEM , 2011, Comput. Geosci..

[27]  Rui Li,et al.  Adopting cloud computing to optimize spatial web portals for better performance to support Digital Earth and other global geospatial initiatives , 2015, Int. J. Digit. Earth.

[28]  Chaitanya K. Baru,et al.  Evaluation of MapReduce for Gridding LIDAR Data , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[29]  Nik Bessis,et al.  Advanced ICTs for Disaster Management and Threat Detection: Collaborative and Distributed Frameworks , 2010 .

[30]  Sheng Wang,et al.  Retrieving and Indexing Spatial Data in the Cloud Computing Environment , 2009, CloudCom.

[31]  Shaowen Wang,et al.  A theoretical approach to the use of cyberinfrastructure in geographical analysis , 2009, Int. J. Geogr. Inf. Sci..

[32]  Divyakant Agrawal,et al.  Big data and cloud computing: current state and future opportunities , 2011, EDBT/ICDT '11.

[33]  Chaowei Yang,et al.  Enabling Big Geoscience Data Analytics with a Cloud-Based, MapReduce-Enabled and Service-Oriented Workflow Framework , 2015, PloS one.

[34]  Carlos Maltzahn,et al.  SciHadoop: Array-based query processing in Hadoop , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[35]  P. Mell,et al.  The NIST Definition of Cloud Computing , 2011 .

[36]  Feng-Cheng Lin,et al.  Storage and processing of massive remote sensing images using a novel cloud computing platform , 2013 .

[37]  Rini T. Kaushik,et al.  GreenHDFS: towards an energy-conserving, storage-efficient, hybrid Hadoop compute cluster , 2010 .

[38]  Zhenlong Li,et al.  A spatiotemporal indexing approach for efficient processing of big array-based climate data with MapReduce , 2017, Int. J. Geogr. Inf. Sci..

[39]  Mark Gahegan,et al.  Geospatial Cyberinfrastructure: Past, present and future , 2010, Comput. Environ. Urban Syst..