A provenance-based approach to evaluate data quality in eScience

Data quality is growing in relevance as a research topic. Quality assessment has been progressively incorporated in many business environments, and in software engineering practices. eScience environments, however, because of the multiplicity and heterogeneity of data sources and scientific experts involved in a given problem, complicate data quality assessment. This paper deals with the evaluation of the quality of data managed by eScience applications. Our approach is based on data provenance, i.e. the history of the origins and transformations applied to a given data product. Our contributions include a the specification of a framework to track data provenance and use it to derive quality information, b a model for data provenance based on the Open Provenance Model, and c a methodology to evaluate the quality of data based on its provenance. Our proposal is validated experimentally by a prototype that takes advantage of the Taverna workflow system.

[1]  Sudha Ram,et al.  Understanding the Semantics of Data Provenance to Support Active Conceptual Modeling , 2006, Active Conceptual Modeling of Learning.

[2]  Richard Y. Wang,et al.  Anchoring data quality dimensions in ontological foundations , 1996, CACM.

[3]  Edward D. Lazowska,et al.  Trident: Scientific Workflow Workbench for Oceanography , 2008, 2008 IEEE Congress on Services - Part I.

[4]  Yun Peng,et al.  On Homeland Security and the Semantic Web: A Provenance and Trust Aware Inference Framework , 2005, AAAI Spring Symposium: AI Technologies for Homeland Security.

[5]  Felix Naumann,et al.  Assessment Methods for Information Quality Criteria , 2000, IQ.

[6]  Olaf Hartig,et al.  Using Web Data Provenance for Quality Assessment , 2009, SWPM.

[7]  Cláudio T. Silva,et al.  Querying and re-using workflows with VsTrails , 2008, SIGMOD Conference.

[8]  Frada Burstein,et al.  Using Machine Learning to Support Resource Quality Assessment: An Adaptive Attribute-Based Approach for Health Information Portals , 2011, DASFAA Workshops.

[9]  Amir Parssian,et al.  Managerial decision support with knowledge of accuracy and completeness of the relational aggregate functions , 2006, Decis. Support Syst..

[10]  Richard Y. Wang,et al.  Data quality assessment , 2002, CACM.

[11]  Elizabeth M. Pierce Assessing data quality with control matrices , 2004, CACM.

[12]  Agnès Voisard,et al.  Database Support for Cooperative Work Documentation , 2000, COOP.

[13]  Yogesh L. Simmhan,et al.  A survey of data provenance in e-science , 2005, SGMD.

[14]  Fernando Lemos Infrastructure and algorithms for information quality analysis and process discovery , 2013 .

[15]  Beth Plale,et al.  Provenance analysis: Towards quality provenance , 2012, 2012 IEEE 8th International Conference on E-Science.

[16]  Paul Resnick,et al.  Reputation systems , 2000, CACM.

[17]  Claudia Bauzer Medeiros,et al.  Data Quality in Agriculture Applications , 2012 .

[18]  James Cheney,et al.  Provenance in Databases: Why, How, and Where , 2009, Found. Trends Databases.

[19]  Stuart E. Madnick,et al.  Overview and Framework for Data and Information Quality Research , 2009, JDIQ.

[20]  Paul Mangiameli,et al.  The Effects and Interactions of Data Quality and Problem Complexity on Classification , 2011, JDIQ.

[21]  Deborah L. McGuinness,et al.  PROV-O: The PROV Ontology , 2013 .

[22]  Molly E. Brown,et al.  Evaluation of the consistency of long-term NDVI time series derived from AVHRR,SPOT-vegetation, SeaWiFS, MODIS, and Landsat ETM+ sensors , 2006, IEEE Transactions on Geoscience and Remote Sensing.

[23]  Michael F. Goodchild,et al.  Assuring the quality of volunteered geographic information , 2012 .

[24]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[25]  Connolly,et al.  Database Systems , 2004 .

[26]  Felix Naumann,et al.  Quality-Driven Query Answering for Integrated Information Systems , 2002, Lecture Notes in Computer Science.

[27]  Yogesh L. Simmhan,et al.  The Open Provenance Model core specification (v1.1) , 2011, Future Gener. Comput. Syst..

[28]  Deborah L. McGuinness,et al.  Explaining answers from the Semantic Web: the Inference Web approach , 2004, J. Web Semant..

[29]  Audun Jøsang,et al.  A survey of trust and reputation systems for online service provision , 2007, Decis. Support Syst..

[30]  Elisa Bertino,et al.  An Approach to Evaluate Data Trustworthiness Based on Data Provenance , 2008, Secure Data Management.

[31]  Deborah L. McGuinness,et al.  A proof markup language for Semantic Web services , 2006, Inf. Syst..

[32]  Claudia Bauzer Medeiros,et al.  A framework for semantic annotation of geospatial data for agriculture , 2009, Int. J. Metadata Semant. Ontologies.

[33]  Susan B. Davidson,et al.  Addressing the provenance challenge using ZOOM , 2008, Concurr. Comput. Pract. Exp..

[34]  Stuart E. Madnick,et al.  Measuring Data Believability: A Provenance Approach , 2007, Proceedings of the 41st Annual Hawaii International Conference on System Sciences (HICSS 2008).

[35]  Carole A. Goble,et al.  The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud , 2013, Nucleic Acids Res..

[36]  Pedro R. Falcone Sampaio,et al.  Incorporating the Timeliness Quality Dimension in Internet Query Systems , 2005, WISE Workshops.

[37]  Jennifer Widom,et al.  Database Systems: The Complete Book , 2001 .

[38]  Arthur Chapman,et al.  © 2005, Global Biodiversity Information Facility Material in this publication is free to use, with proper attribution. Recommended citation format: Chapman, A. D. 2005. Principles of Data Quality, version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen. , 2005 .

[39]  James D. Myers,et al.  Embedding Data within Knowledge Spaces , 2009, ArXiv.

[40]  Diane M. Strong,et al.  AIMQ: a methodology for information quality assessment , 2002, Inf. Manag..

[41]  Richard A. Pearsall,et al.  United States of America Content Standard for Digital Geospatial Metadata FGDC-STD-001-1998 , 2005 .

[42]  Frank Leymann,et al.  A Novel Framework for Monitoring and Analyzing Quality of Data in Simulation Workflows , 2011, 2011 IEEE Seventh International Conference on eScience.

[43]  Jens Lehmann,et al.  DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia , 2015, Semantic Web.

[44]  James Cheney,et al.  Provenance management in curated databases , 2006, SIGMOD Conference.

[45]  Marco A. Casanova,et al.  Trust Indicator For Decisions Based On Geospatial Data , 2011, GeoInfo.

[46]  D. Deering Rangeland reflectance characteristics measured by aircraft and spacecraft sensors , 1979 .

[47]  Olaf Hartig Provenance Information in the Web of Data , 2009, LDOW.

[48]  Edmundo Roberto Mauro Madeira,et al.  TRACEABILITY IN FOOD FOR SUPPLY CHAINS , 2006 .

[49]  Amit P. Sheth,et al.  Provenir Ontology: Towards a Framework for eScience Provenance Management , 2009 .

[50]  Roger S. Barga,et al.  Automatic Generation of Workflow Provenance , 2006, IPAW.

[51]  Russell G. Congalton,et al.  Assessing the accuracy of remotely sensed data : principles and practices , 1998 .

[52]  Huub Scholten,et al.  Quality assessment of the simulation modeling process , 1999 .