Storing and manipulating environmental big data with JASMIN

JASMIN is a super-data-cluster designed to provide a high-performance high-volume data analysis environment for the UK environmental science community. Thus far JASMIN has been used primarily by the atmospheric science and earth observation communities, both to support their direct scientific workflow, and the curation of data products in the STFC Centre for Environmental Data Archival (CEDA). Initial JASMIN configuration and first experiences are reported here. Useful improvements in scientific workflow are presented. It is clear from the explosive growth in stored data and use that there was a pent up demand for a suitable big-data analysis environment. This demand is not yet satisfied, in part because JASMIN does not yet have enough compute, the storage is fully allocated, and not all software needs are met. Plans to address these constraints are introduced.

[1]  Liang Dong,et al.  Starfish: A Self-tuning System for Big Data Analytics , 2011, CIDR.

[2]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[3]  Bryan N. Lawrence,et al.  The Earth System Grid Federation: Delivering globally accessible petascale data for CMIP5 , 2011 .

[4]  Jamie Kettleborough,et al.  High-resolution global climate modelling: the UPSCALE project, a large-simulation campaign , 2014 .

[5]  Richard Siddans,et al.  Cloud retrievals from satellite data using optimal estimation: evaluation and application to ATSR , 2011 .

[6]  Joseph M. Hellerstein,et al.  MAD Skills: New Analysis Practices for Big Data , 2009, Proc. VLDB Endow..

[7]  Bryan N. Lawrence,et al.  High-resolution global climate modelling: the UPSCALE project, a large-simulation campaign , 2014 .

[8]  Karl E. Taylor,et al.  An overview of CMIP5 and the experiment design , 2012 .

[9]  Georgia Sakellari,et al.  A survey of mathematical models, simulation approaches and testbeds used for research in cloud computing , 2013, Simul. Model. Pract. Theory.

[10]  Peter R. J. North,et al.  The ESA GlobAlbedo Project for mapping the Earth's land surface albedo for 15 Years from European Sensors. , 2012, IGARSS 2012.

[11]  Carlos Maltzahn,et al.  SciHadoop: Array-based query processing in Hadoop , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[12]  Darren Ghent,et al.  Developing first time-series of land surface temperature from AATSR with uncertainty estimates , 2013 .

[13]  Bryan Lawrence,et al.  The JASMIN super-data-cluster , 2012, ArXiv.