A provenance-based approach to manage long term preservation of scientific data

Long term preservation of scientific data goes beyond the data, and extends to metadata preservation and curation. While several researchers emphasize curation processes, our work is geared towards assessing the quality of scientific (meta)data. The rationale behind this strategy is that scientific data are often accessible via metadata - and thus ensuring metadata quality is a means to provide long term accessibility. This paper discusses our quality assessment architecture, presenting a case study on animal sound recording metadata. Our case study is an example of the importance of periodically assessing (meta)data quality, since knowledge about the world may evolve, and quality decrease with time, hampering long term preservation.

[1]  Amir Parssian,et al.  Managerial decision support with knowledge of accuracy and completeness of the relational aggregate functions , 2006, Decis. Support Syst..

[2]  Claudia Bauzer Medeiros,et al.  Managing Animal Sounds-Some Challenges and Research Directions , 2011 .

[3]  Tim Berners-Lee,et al.  Linked Data - The Story So Far , 2009, Int. J. Semantic Web Inf. Syst..

[4]  Carole A. Goble,et al.  Why Linked Data is Not Enough for Scientists , 2010, 2010 IEEE Sixth International Conference on e-Science.

[5]  Yogesh L. Simmhan,et al.  The Open Provenance Model core specification (v1.1) , 2011, Future Gener. Comput. Syst..

[6]  Claudia Bauzer Medeiros,et al.  An architecture for retrieval of animal sound recordings based on context variables , 2013, Concurr. Comput. Pract. Exp..

[7]  Gang Chen,et al.  Status Report of the DPHEP Study Group: Towards a Global Effort for Sustainable Data Preservation in High Energy Physics , 2012, ArXiv.

[8]  Fernando Lemos Infrastructure and algorithms for information quality analysis and process discovery , 2013 .

[9]  Huub Scholten,et al.  Quality assessment of the simulation modeling process , 1999 .

[10]  André Santanchè,et al.  A provenance-based approach to evaluate data quality in eScience , 2014, Int. J. Metadata Semant. Ontologies.

[11]  Michael Clausen,et al.  The animal sound archive at the Humboldt-University of Berlin: current activities in conservation and improving access for bioacoustic research , 2006 .

[12]  Renée J. Miller,et al.  Active repair of data quality rules , 2011, ICIQ.

[13]  Daniel Crawl,et al.  Monitoring data quality in Kepler , 2010, HPDC '10.

[14]  L. F. Toledo,et al.  Reproductive biology of Scinax fuscomarginatus (Anura, Hylidae) in south‐eastern Brazil , 2005 .

[15]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[16]  Carole A. Goble,et al.  Taverna: a tool for building and running workflows of services , 2006, Nucleic Acids Res..

[17]  Maximo Cobos,et al.  Listen up—The present and future of audio signal processing , 2010, IEEE Potentials.

[18]  U. Caramaschi Notes on the taxonomic status of Elachistocleis ovalis (Schneider, 1799) and description of five new species of Elachistocleis Parker, 1927 (Amphibia, Anura, Microhylidae) , 2010 .

[19]  David Giaretta,et al.  Curating Scientific Research Data for the Long Term: A Preservation Analysis Method in Context , 2011, Int. J. Digit. Curation.

[20]  Richard Y. Wang,et al.  Anchoring data quality dimensions in ontological foundations , 1996, CACM.

[21]  Claudia Bauzer Medeiros,et al.  Introducing shadows: Flexible document representation and annotation on the Web , 2013, 2013 IEEE 29th International Conference on Data Engineering Workshops (ICDEW).

[22]  Andreas Rauber,et al.  Digital Preservation , 2009, Handbook of Research on Digital Libraries.

[23]  Rolf Bardeli,et al.  Similarity Search in Animal Sound Databases , 2009, IEEE Transactions on Multimedia.

[24]  Carole A. Goble,et al.  Quality, trust, and utility of scientific data on the web: towards a joint model , 2011, WebSci '11.

[25]  Verónika Peralta,et al.  Qbox-Foundation : a Metadata Platform for Quality Measurement , 2008 .

[26]  R. Ranft Natural sound archives: past, present and future. , 2004, Anais da Academia Brasileira de Ciencias.

[27]  Shashi Shekhar,et al.  A Geographical Approach for Metadata Quality Improvement in Biological Observation Databases , 2013, 2013 IEEE 9th International Conference on e-Science.

[28]  Shawn Bowers,et al.  ObsDB: A System for Uniformly Storing and Querying Heterogeneous Observational Data , 2010, 2010 IEEE Sixth International Conference on e-Science.

[29]  Carole A. Goble,et al.  Why workflows break — Understanding and combating decay in Taverna workflows , 2012, 2012 IEEE 8th International Conference on E-Science.