Evaluation of Big Data Containers for Popular Storage, Retrieval, and Computation Primitives in Earth Science Analysis

• Data containers are infrastructures that facilitate storage, retrieval, and analysis of data sets. Big data applications in Earth Science require a mix of processing techniques, data sources and storage formats that are supported by different data containers. The data containers compared in this study are • AsterixDB, • RasDaMan, • SciDB • Hadoop • HDF • These infrastructures optimize different aspects of the data processing pipeline and are, therefore, suitable for different types of applications. These containers are also all undergoing rapid evolution and the ability to re-test, as they evolve, is very important to our handling of the large volumes of observational data and model output. We have identified a selection of steps that are relevant to most data processing exercises in Earth Science applications and we evaluate these systems for optimal performance for each of these steps in the data processing pipeline. The steps evaluated in this study: • Hardware/software dependencies • Data ingestion • Data preparation/processing • Data analysis • Result reporting AsterixDB Rasdaman SciDB Hadoop HDF