gcamdata: An R Package for Preparation, Synthesis, and Tracking of Input Data for the GCAM Integrated Human-Earth Systems Model

The increasing data requirements of complex models demand robust, reproducible, and transparent systems to track and prepare models’ inputs. Here we describe version 1.0 of the gcamdata R package that processes raw inputs to produce the hundreds of XML files needed by the GCAM integrated human-earth systems model. It features extensive functional and unit testing, data tracing and visualization, and enforces metadata, documentation, and flexibility in its component data-processing subunits. Although this package is specific to GCAM, many of its structural pieces and approaches should be broadly applicable to, and reusable by, other complex model/data systems aiming to improve transparency, reproducibility, and flexibility. Funding statement: Primary support for this work was provided by the U.S. Department of Energy, Office of Science, as part of research in Multi-Sector Dynamics, Earth and Environmental System Modeling Program. Additional support was provided by the U.S. Department of Energy Offices of Fossil Energy, Nuclear Energy, and Energy Efficiency and Renewable Energy and the U.S. Environmental Protection Agency.

[1]  Jeffrey P. Walker,et al.  THE GLOBAL LAND DATA ASSIMILATION SYSTEM , 2004 .

[2]  Volker Krey,et al.  Global energy‐climate scenarios and models: a review , 2014 .

[3]  Ian M. Mitchell,et al.  Best Practices for Scientific Computing , 2012, PLoS biology.

[4]  B. Law,et al.  Archiving numerical models of biogeochemical dynamics , 2005 .

[5]  P. Bryan Heidorn,et al.  Shedding Light on the Dark Data in the Long Tail of Science , 2008, Libr. Trends.

[6]  J. Edmonds,et al.  The ObjECTS Framework for Integrated Assessment: Hybrid Modeling of Transportation , 2006 .

[7]  Brian C. O'Neill,et al.  A comprehensive view on climate change: coupling of earth system and integrated assessment models , 2012 .

[8]  Jianjun Zhao,et al.  Data-flow-based unit testing of aspect-oriented programs , 2003, Proceedings 27th Annual International Computer Software and Applications Conference. COMPAC 2003.

[9]  A. Thomson,et al.  The representative concentration pathways: an overview , 2011 .

[10]  Sylvia Tippmann,et al.  Programming tools: Adventures with R , 2014, Nature.

[11]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[12]  Darrel C. Ince,et al.  The case for open computer programs , 2012, Nature.

[13]  L. K. Gohar,et al.  How well do integrated assessment models simulate climate change? , 2011 .

[14]  G. Heuvelink,et al.  SoilGrids1km — Global Soil Information Based on Automated Mapping , 2014, PloS one.

[15]  K. Calvin,et al.  GCAM 3.0 Agriculture and Land Use: Data Sources and Methods , 2011 .

[16]  Jez Humble,et al.  Continuous Delivery: Reliable Software Releases Through Build, Test, and Deployment Automation , 2010 .

[17]  Jonathan Adams Collaborations: The rise of research networks , 2012, Nature.

[18]  Ben Marwick,et al.  Packaging Data Analytical Work Reproducibly Using R (and Friends) , 2018 .

[19]  Meng Li,et al.  Historical (1750–2014) anthropogenic emissions of reactive gases and aerosols from the Community Emissions Data System (CEDS) , 2017 .

[20]  R. Peng Reproducible Research in Computational Science , 2011, Science.

[21]  P. Kyle,et al.  The SSP4: A world of deepening inequality , 2017 .

[22]  Peter E. Thornton,et al.  A functional test platform for the Community Land Model , 2014, Environ. Model. Softw..

[23]  Sanjeev Khanna,et al.  Data Provenance: Some Basic Issues , 2000, FSTTCS.