A high performance query analytical framework for supporting data-intensive climate studies

Abstract Climate observations and model simulations produce vast amounts of data. The unprecedented data volume and the complexity of geospatial statistics and analysis requires efficient analysis of big climate data to investigate global problems such as climate change, natural disasters, diseases, and other environmental issues. This paper introduces a high performance query analytical framework to tackle these challenges by leveraging Hive and cloud computing technologies. With this framework, we propose grid transformation, a new perspective for complex climate analysis that applies a series of atomic transformations to terabytes of climate data using SQL-style query (HiveQL). Specifically, we introduce four types of grid transformations (temporal, spatial, local, and arithmetic) to support a broad range of climate analyses, from the basic spatiotemporal aggregation to more sophisticated anomaly detection. Each query is processed as MapReduce tasks in a highly scalable Hadoop cluster as the parallel processing engine. Big climate data are directly stored and managed in a Hadoop Distributed File System without any data format conversion. A prototype is developed to evaluate the feasibility and performance of the framework. Experimental results show that complex and data-intensive climate analysis can be conducted using intuitive SQL queries with good flexibility and performance. This research provides a building block and practical insights in establishing a cyberinfrastructure that provides a high performance and collaborative environment for data-intensive geospatial applications in climate science.

[1]  Dawn J. Wright,et al.  The emergence of spatial cyberinfrastructure , 2011, Proceedings of the National Academy of Sciences.

[2]  Zhenlong Li,et al.  Building Model as a Service to support geosciences , 2017, Comput. Environ. Urban Syst..

[3]  Zhenlong Li,et al.  Automatic Scaling Hadoop in the Cloud for Efficient Process of Big Geospatial Data , 2016, ISPRS Int. J. Geo Inf..

[4]  K. Oleson,et al.  Avoided climate impacts of urban and rural heat and cold waves over the U.S. using large climate model ensembles for RCP8.5 and RCP4.5 , 2018, Climatic Change.

[5]  Le Gruenwald,et al.  Large-scale spatial data processing on GPUs and GPU-accelerated clusters , 2015, SIGSPACIAL.

[6]  R. Solberg EuroClim: Monitoring the Cryosphere to Improve Climate Change Modelling , 2002 .

[7]  Yaxing Wei,et al.  UV-CDAT: Analyzing Climate Datasets from a User's Perspective , 2013, Computing in Science & Engineering.

[8]  Thomas L. Clune,et al.  Preliminary Evaluation of MapReduce for High-Performance Climate Data Analysis , 2012 .

[9]  Shaowen Wang,et al.  A MapReduce approach to Gi*(d) spatial statistic , 2010, HPDGIS '10.

[10]  S. Schubert,et al.  MERRA: NASA’s Modern-Era Retrospective Analysis for Research and Applications , 2011 .

[11]  Paul G. Brown,et al.  Overview of sciDB: large scale array storage, processing and analysis , 2010, SIGMOD Conference.

[12]  M. Betsill,et al.  Rethinking Sustainable Cities: Multilevel Governance and the 'Urban' Politics of Climate Change , 2005 .

[13]  Peter Baumann,et al.  The multidimensional database system RasDaMan , 1998, SIGMOD '98.

[14]  Gary Lee Planthaber,et al.  MODBASE : a SciDB-powered system for large-scale distributed storage and analysis of MODIS earth remote sensing data , 2012 .

[15]  Mattia Monga,et al.  MaRDiGraS: Simplified Building of Reachability Graphs on Large Clusters , 2013, RP.

[16]  Dave Stainforth,et al.  Climateprediction.net: Design Principles for Publicresource Modeling Research , 2002, IASTED PDCS.

[17]  H. Schroeder,et al.  Cities and Climate Change: The role of institutions, governance and urban planning , 2011 .

[18]  Zhenlong Li,et al.  A spatiotemporal indexing approach for efficient processing of big array-based climate data with MapReduce , 2017, Int. J. Geogr. Inf. Sci..

[19]  Michael Stonebraker,et al.  A Demonstration of SciDB: A Science-Oriented DBMS , 2009, Proc. VLDB Endow..

[20]  Mark Gahegan,et al.  Geospatial Cyberinfrastructure: Past, present and future , 2010, Comput. Environ. Urban Syst..

[21]  Michael F. Goodchild,et al.  A Geospatial Cyberinfrastructure for Urban Economic Analysis and Spatial Decision-Making , 2013, ISPRS Int. J. Geo Inf..

[22]  Andrew T. Wilson,et al.  Visualization of uncertainty and ensemble data: Exploration of climate modeling and weather forecast data with integrated ViSUS-CDAT systems , 2009 .

[23]  Ryosuke Shibasaki,et al.  The Design of Large Scale Data Management for Spatial Analysis on Mobile Phone Dataset , 2013 .

[24]  Qunying Huang,et al.  A Web-Based Geovisual Analytical System for Climate Studies , 2012, Future Internet.

[25]  Shaowen Wang,et al.  CyberGIS software: a synthetic review and integration roadmap , 2013, Int. J. Geogr. Inf. Sci..

[26]  Peter Baumann,et al.  Spatio-Temporal Retrieval with RasDaMan , 1999, VLDB.

[27]  Tanu Malik GeoBase: Indexing NetCDF Files for Large-Scale Data Analysis , 2014 .

[28]  Guangwen Yang,et al.  SciHive: Array-Based Query Processing with HiveQL , 2013, 2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications.

[29]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[30]  John L. Schnase,et al.  MERRA Analytic Services: Meeting the Big Data challenges of climate science through cloud-enabled Climate Analytics-as-a-Service , 2013, Comput. Environ. Urban Syst..

[31]  Michael F. Goodchild,et al.  Spatial cloud computing: how can the geospatial sciences use and help shape cloud computing? , 2011, Int. J. Digit. Earth.

[32]  Derek Anderson,et al.  A multilevel parallel and scalable single-host GPU cluster framework for large-scale geospatial data processing , 2014, 2014 IEEE Geoscience and Remote Sensing Symposium.

[33]  Qunying Huang,et al.  Utilize cloud computing to support dust storm forecasting , 2013, Int. J. Digit. Earth.

[34]  Chris Mattmann,et al.  SciSpark: Applying in-memory distributed computing to weather event detection and tracking , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[35]  Chaowei Yang,et al.  Enabling Big Geoscience Data Analytics with a Cloud-Based, MapReduce-Enabled and Service-Oriented Workflow Framework , 2015, PloS one.

[36]  Milton Halem,et al.  Cloud Computing for Satellite Data Processing on High End Compute Clusters , 2009, 2009 IEEE International Conference on Cloud Computing.

[37]  Qunying Huang,et al.  A data-driven framework for archiving and exploring social media data , 2014, Ann. GIS.

[38]  Zheng Shao,et al.  Hive - a petabyte scale data warehouse using Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[39]  Ahmed Eldawy,et al.  SHAHED: A MapReduce-based system for querying and visualizing spatio-temporal satellite data , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[40]  L. Manovich,et al.  Trending: The Promises and the Challenges of Big Social Data , 2012 .

[41]  H. Liu Comparing NetCDF and a multidimensional array database on managing and querying large hydrologic datasets: A case study of SciDB , 2014 .

[42]  Gregory G. Leptoukh,et al.  Online analysis enhances use of NASA Earth science data , 2007 .