Accommodating derived data with an enhanced Core Scientific Metadata model

The Core Scientific MetaData model (CSMD) is used by large scientific facilities to catalogue scientific data. The current version provides support to experimental scientists to access their raw data, facility managers for accounting for facility usage and other scientists who wish to re-use raw experimental data. Much of the value in scientific data is provided not only in the raw data but through the analysis of that data to derive published results. An analysis of the raw data analysis process for structural science has shown that various data sets derived from the raw data are of use to scientists and should be stored with the raw data. Extensions to the CSMD are presented to describe the analysis process so that the provenance of the derived data can be captured. A pilot implementation incorporating derived data through this extended CSMD model has been trialled with experimental scientists. Remaining challenges to the adoption of CSMD and tools it supports are considered.

[1]  Kerstin Kleese van Dam,et al.  Using a Core Scientific Metadata Model in Large-Scale Facilities , 2010, Int. J. Digit. Curation.

[2]  Ian T. Foster,et al.  The virtual data grid: a new model and architecture for data-intensive collaboration , 2003, 15th International Conference on Scientific and Statistical Database Management, 2003..

[3]  Klaus R. Dittrich,et al.  Data Provenance: A Categorization of Existing Approaches , 2007, BTW.

[4]  Matthew R. Pocock,et al.  Taverna: a tool for the composition and enactment of bioinformatics workflows , 2004, Bioinform..

[5]  Kerstin Kleese van Dam,et al.  ICAT: Integrating Data Infrastructure for Facilities Based Science , 2009, 2009 Fifth IEEE International Conference on e-Science.

[6]  Russ Rew,et al.  NetCDF: an interface for scientific data access , 1990, IEEE Computer Graphics and Applications.

[7]  Raymond Osborn,et al.  NeXus: A common format for the exchange of neutron and synchroton data , 1997 .

[8]  Les Carr,et al.  An E-Science Environment for Service Crystallography-from Submission to Dissemination , 2006, J. Chem. Inf. Model..

[9]  Sean Bechhofer,et al.  Research Objects: Towards Exchange and Reuse of Digital Knowledge , 2010 .

[10]  Emil C. Lupu,et al.  A Labelling System for Derived Data Control , 2010, DBSec.

[11]  Rajkumar Buyya,et al.  A taxonomy of scientific workflow systems for grid computing , 2005, SGMD.

[12]  Andrew Shepherd,et al.  Hierarchial Task Analysis , 2000 .

[13]  Qun Hui,et al.  RMCProfile: reverse Monte Carlo for polycrystalline materials , 2007, Journal of physics. Condensed matter : an Institute of Physics journal.

[14]  Shoaib Sufi,et al.  A Metadata Model for the Discovery and Exploitation of Scientific Studies , 2005, Knowledge and Data Management in GRIDs.