Data Quality Evaluation of Scientific Datasets - A Case Study in a Policy Support Context

In this work we present the rule-based approach used to evaluate the qu ality of scientific datasets in a policy support context. The used case study refers to real datasets in a conte xt where low data quality limits the accuracy of the analysis results and, consequently, the significance of th provided policy advice. The applied solution consists in the identification of types of constraints that can be usefu l as data quality rules and in the development of a software tool to evaluate a dataset on the basis of a s et of rules expressed in the XML markup language. As rule types we selected some types of data constrain ts and dependencies already proposed in data quality works, but we experimented also the use of order depende ncies and existence constraints. The case study was used to develop and test the adopted solution, which is anyw ay generally applicable to other

[1]  Jan Chomicki,et al.  Query Answering in Inconsistent Databases , 2003, Logics for Emerging Applications of Databases.

[2]  Jarek Gryz,et al.  Fundamentals of Order Dependencies , 2012, Proc. VLDB Endow..

[3]  Henri Prade,et al.  Handling Dirty Databases: From User Warning to Data Cleaning - Towards an Interactive Approach , 2010, SUM.

[4]  José Barateiro,et al.  A Survey of Data Quality Tools , 2005, Datenbank-Spektrum.

[5]  Ulrich Güntzer,et al.  Data Quality Mining - Making a Virute of Necessity , 2001, DMKD.

[6]  Panos Vassiliadis,et al.  A Survey of Extract-Transform-Load Technology , 2009, Int. J. Data Warehous. Min..

[7]  Yong Chen Quality of fisheries data and uncertainty in stock assessment , 2003 .

[8]  Simone Sacchi,et al.  Definitions of dataset in the scientific and technical literature , 2010, ASIST.

[9]  Wenguang Chen,et al.  Analyses and Validation of Conditional Dependencies with Built-in Predicates , 2009, DEXA.

[10]  Lei Chen,et al.  Differential dependencies: Reasoning and discovery , 2011, TODS.

[11]  Wenfei Fan,et al.  Conditional Functional Dependencies for Data Cleaning , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[12]  Wenfei Fan,et al.  Semandaq: a data quality system based on conditional functional dependencies , 2008, Proc. VLDB Endow..

[13]  Wilfred Ng,et al.  Ordered Functional Dependencies in Relational Databases , 1999, Inf. Syst..

[14]  Shuai Ma,et al.  Improving Data Quality: Consistency and Accuracy , 2007, VLDB.

[15]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[16]  Richard Y. Wang,et al.  Journey to Data Quality , 2006 .

[17]  Richard Hull,et al.  Order Dependency in the Relational Model , 1983, Theor. Comput. Sci..

[18]  Theodore Johnson,et al.  Exploratory Data Mining and Data Cleaning , 2003 .

[19]  Joseph M. Hellerstein,et al.  Quantitative Data Cleaning for Large Databases , 2008 .

[20]  Paolo Atzeni,et al.  Functional Dependencies and Constraints on Null Values in Database Relations , 1986, Inf. Control..