Ensuring reliable datasets for environmental models and forecasts

Abstract At the dawn of the 21st century, environmental scientists are collecting more data more rapidly than at any time in the past. Nowhere is this change more evident than in the advent of sensor networks able to collect and process (in real time) simultaneous measurements over broad areas and at high sampling rates. At the same time there has been great progress in the development of standards, methods, and tools for data analysis and synthesis, including a new standard for descriptive metadata for ecological datasets (Ecological Metadata Language) and new workflow tools that help scientists to assemble datasets and to diagram, record, and execute analyses. However these developments (important as they are) are not yet sufficient to guarantee the reliability of datasets created by a scientific process — the complex activity that scientists carry out in order to create a dataset. We define a dataset to be reliable when the scientific process used to create it is (1) reproducible and (2) analyzable for potential defects. To address this problem we propose the use of an analytic web , a formal representation of a scientific process that consists of three coordinated graphs (a data-flow graph, a dataset-derivation graph, and a process-derivation graph) originally developed for use in software engineering. An analytic web meets the two key requirements for ensuring dataset reliability: (1) a complete audit trail of all artifacts (e.g., datasets, code, models) used or created in the execution of the scientific process that created the dataset, and (2) detailed process metadata that precisely describe all sub-processes of the scientific process. Construction of such metadata requires the semantic features of a high-level process definition language. In this paper we illustrate the use of an analytic web to represent the scientific process of constructing estimates of ecosystem water flux from data gathered by a complex, real-time multi-sensor network. We use Little-JIL, a high-level process definition language, to precisely and accurately capture the analytical processes involved. We believe that incorporation of this approach into existing tools and evolving metadata specifications (such as EML) will yield significant benefits to science. These benefits include: complete and accurate representations of scientific processes; support for rigorous evaluation of such processes for logical and statistical errors and for propagation of measurement error; and assurance of dataset reliability for developing sound models and forecasts of environmental change.

[1]  Bertram Ludäscher,et al.  A Model for User-Oriented Data Provenance in Pipelined Scientific Workflows , 2006, IPAW.

[2]  Aaron M. Ellison,et al.  AN INTRODUCTION TO BAYESIAN INFERENCE FOR ECOLOGICAL RESEARCH AND ENVIRONMENTAL , 1996 .

[3]  Stephen R. Carpenter,et al.  UNCERTAINTY AND THE MANAGEMENT OF MULTISTATE ECOSYSTEMS: AN APPARENTLY RATIONAL ROUTE TO COLLAPSE , 2003 .

[4]  Leon J. Osterweil,et al.  The design of a next-generation process language , 1997, ESEC '97/FSE-5.

[5]  Carlo Ghezzi,et al.  Solfware process model evolution in the SPADE environment : The evolution of software processes , 1993 .

[6]  Ramón Margalef Perspectives in Ecological Theory , 1968 .

[7]  Tim Oates,et al.  Toward a Theoretical Understanding of Why and When Decision Tree Pruning Algorithms Fail , 1999, AAAI/IAAI.

[8]  William K. Michener,et al.  NONGEOSPATIAL METADATA FOR THE ECOLOGICAL SCIENCES , 1997 .

[9]  Daniel Atkins,et al.  Revolutionizing Science and Engineering Through Cyberinfrastructure: Report of the National Science Foundation Blue-Ribbon Advisory Panel on Cyberinfrastructure , 2003 .

[10]  Lori A. Clarke,et al.  Process Technology to Facilitate the Conduct of Science , 2005, ISPW.

[11]  Shawn Bowers,et al.  The New Bioinformatics: Integrating Ecological Data from the Gene to the Biosphere , 2006 .

[12]  Division on Earth Sharing Publication-Related Data and Materials:: Responsibilities of Authorship in the Life Sciences , 2003 .

[13]  Bertram Ludäscher,et al.  Kepler: an extensible system for design and execution of scientific workflows , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[14]  M. Suzuki,et al.  Meta-Operations in the Process Model Hfsp for the Dynamics and Flexibility of Software Processes , 1991, Proceedings. First International Conference on the Software Process,.

[15]  Leon J. Osterweil,et al.  Using Little-JIL to coordinate agents in software engineering , 2000, Proceedings ASE 2000. Fifteenth IEEE International Conference on Automated Software Engineering.

[16]  Dennis D. Baldocchi,et al.  A comparison of methods for determining forest evapotranspiration and its components: sap-flow, soil water budget, eddy covariance and catchment water balance , 2001 .

[17]  R. Wise,et al.  Carbon-Accounting Methods and Reforestation Incentives , 2003 .

[18]  Margo I. Seltzer,et al.  Issues in Automatic Provenance Collection , 2006, IPAW.

[19]  H. Davidson,et al.  Process Modeling in HP SoftBench , 1990, 'Support for the Software Process'.,Proceedings of the 6th International Software Process Workshop.

[20]  Matthias Schwab,et al.  Making scientific computations reproducible , 2000, Comput. Sci. Eng..

[21]  Allen R. Hanson,et al.  Analytic webs support the synthesis of ecological data sets. , 2006, Ecology.

[22]  S. Eddy,et al.  Sharing Publication-Related Data and Materials: Responsibilities of Authorship in the Life Sciences1 , 2003, Plant Physiology.

[23]  Julian A. Licata,et al.  Time series diagnosis of tree hydraulic characteristics. , 2004, Tree physiology.

[24]  B. Law,et al.  Archiving numerical models of biogeochemical dynamics , 2005 .

[25]  Herman H. Shugart,et al.  17. Simulators as Models of Forest Dynamics , 1989 .

[26]  Lori A. Clarke,et al.  Flow analysis for verifying properties of concurrent software systems , 2004, TSEM.

[27]  TIM M. BLACKBURN,et al.  Reproducibility and Repeatability in Ecology , 2006 .

[28]  P. Hanson,et al.  Wireless Sensor Networks for Ecology , 2005 .

[29]  Dennis D. Baldocchi,et al.  Factors controlling evaporation and energy partitioning beneath a deciduous forest over an annual cycle , 2000 .

[30]  LudäscherBertram,et al.  Scientific workflow management and the Kepler system , 2006 .

[31]  S. Carpenter,et al.  Ecological forecasts: an emerging imperative. , 2001, Science.

[32]  Gail E. Kaiser,et al.  A paradigm for decentralized process modeling and its realization in the OZ environment , 1994, Proceedings of 16th International Conference on Software Engineering.

[33]  Carlo Ghezzi,et al.  Process Model Evolution in the SPADE Environment , 1993, IEEE Trans. Software Eng..

[35]  C. McKean Figures , 1970, Five Long Winters.