MiDas: Containerizing Data-Intensive Applications with I/O Specialization

Scientific applications often depend on data produced from computational models. Model-generated data can be prohibitively large. Current mechanisms for sharing and distributing reproducible applications, such as containers, assume all model data is saved and included with a program to support its successful re-execution. However, including model data increases the sizes of containers. This increases the cost and time required for deployment and further reuse. We present a framework named MiDas ("Minimizing Datasets") for specializing I/O libraries which, given an application, automates the process of identifying and including only a subset of the data accessed by the program. To do this, MiDas combines static and dynamic analysis techniques to map high level user inputs to low level file offsets. We show several orders of magnitude reduction in data size via specialization of I/O libraries associated with model-based data-intensive applications, such as those operating on meteorological and geophysical data.