Beating data bottlenecks in weather and climate science

The data volumes produced by simulation and observation are large, and growing rapidly. In the case of simulation, plans for future modelling programmes require complicated orchestration of data, and anticipate large user communities. “Download and work at home” is no longer practical for many use-cases. In the case of simulation, these issues are exacerbated by users who want simulation data at grid point resolution instead of at the resolution resolved by the mathematics, and/or who design numerical experiments without knowledge of the storage costs. There is no simple solution to these problems: user education, smarter compression, and better use of tiered storage and smarter workflows are all necessary – but far from sufficient. In this paper, we introduce two approaches to addressing (some) of these data bottlenecks: dedicated data analysis platforms, and smarter storage software. We provide a brief introduction to the JASMIN data storage and analysis facility, and some of the storage tools and approaches being developed by the ESiWACE project. In doing so, we describe some of our observations of real world data handling problems at scale, from the generic performance of file systems to the difficulty of optimising both volume stored and performance of workflows. We use these examples to motivate the two-pronged approach of smarter hardware and smarter software – but recognise that data bottlenecks may yet limit the aspirations of our science.

[1]  Ethan L. Miller,et al.  An Economic Perspective of Disk vs. Flash Media in Archival Storage , 2014, 2014 IEEE 22nd International Symposium on Modelling, Analysis & Simulation of Computer and Telecommunication Systems.

[2]  Luca Cinquini,et al.  Requirements for a global data infrastructure in support of CMIP6 , 2018, Geoscientific Model Development.

[3]  Kesheng Wu,et al.  Data Elevator: Low-Contention Data Movement in Hierarchical Storage System , 2016, 2016 IEEE 23rd International Conference on High Performance Computing (HiPC).

[4]  Bryan Lawrence,et al.  Storing and manipulating environmental big data with JASMIN , 2013, 2013 IEEE International Conference on Big Data.

[5]  Sophie Valcke,et al.  Crossing the chasm: how to develop weather and climate models for next generation computers? , 2017 .

[6]  Prabhat,et al.  An Assessment of Data Transfer Performance for Large-Scale Climate Data Analysis and Recommendations for the Data Infrastructure for CMIP6 , 2017, ArXiv.

[7]  Gary Grider,et al.  MarFS, a Near-POSIX Interface to Cloud Objects , 2017, login Usenix Mag..

[8]  Jon Blower,et al.  A data model of the Climate and Forecast metadata conventions (CF-1.6) with a software implementation (cf-python v2.1) , 2017 .

[9]  Bryan N. Lawrence,et al.  Infrastructure Strategy for the European Earth System Modelling Community 2012-2022 , 2012 .

[10]  André Brinkmann,et al.  Analysis of the ECMWF Storage Landscape , 2015, FAST.

[11]  Xiaohong Zhang,et al.  Combining HPC and Big Data Infrastructures in Large-Scale Post-Processing of Simulation Data: A Case Study , 2018, PEARC '18.

[12]  Bryan N. Lawrence,et al.  The Earth System Grid Federation: Delivering globally accessible petascale data for CMIP5 , 2011 .

[13]  A. Romeo,et al.  Cloud Based Earth Observation Data Exploitation Platforms , 2017 .

[14]  Karsten Schwan,et al.  Adaptable, metadata rich IO methods for portable high performance IO , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[15]  Xian-He Sun,et al.  Hermes: a heterogeneous-aware multi-tiered distributed I/O buffering system , 2018, HPDC.

[16]  Mahmut T. Kandemir,et al.  An Evolutionary Path to Object Storage Access , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[17]  Jun Wang,et al.  SideIO: A Side I/O system framework for hybrid scientific workflow , 2017, J. Parallel Distributed Comput..