A Measure-Theoretic Foundation for Data Quality

In this paper, a novel framework for data quality measurement is proposed by adopting a measure-theoretic treatment of the problem. Instead of considering a specific setting in which quality must be assessed, our approach departs more formally from the concept of measurement. The basic assumption of the framework is that the highest possible quality can be described by means of a set of predicates. Quality of data is then measured by evaluating those predicates and by combining their evaluations. This combination is based on a capacity function (i.e., a fuzzy measure) that models for each combination of predicates the capacity with respect to the quality of the data. It is shown that expression of quality on an ordinal scale entails a high degree of interpretation and a compact representation of the measurement function. Within this purely ordinal framework for measurement, it is shown that reasoning about quality beyond the ordinal level naturally originates from the uncertainty about predicate evaluation. It is discussed how the proposed framework is positioned with respect to other approaches with particular attention to aggregation of measurements. The practical usability of the framework is discussed for several well known dimensions of data quality and demonstrated in a use-case study about clinical trials.

[1]  Marcus Kaiser,et al.  A Procedure to Develop Metrics for Currency and its Application in CRM , 2009, JDIQ.

[2]  Marcus Kaiser,et al.  How to Measure Data Quality? - A Metric-Based Approach , 2007, ICIS.

[3]  Sandra de F. Mendes Sampaio,et al.  DQ2S - A framework for data quality-aware information management , 2015, Expert Syst. Appl..

[4]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[5]  Michael F. Goodchild,et al.  Data quality in massive data sets , 2002 .

[6]  Giri Kumar Tayi,et al.  Enhancing data quality in data warehouse environments , 1999, CACM.

[7]  Veda C. Storey,et al.  A Framework for Analysis of Data Quality Research , 1995, IEEE Trans. Knowl. Data Eng..

[8]  Donald P. Ballou,et al.  Designing Information Systems to Optimize the Accuracy-Timeliness Tradeoff , 1995, Inf. Syst. Res..

[9]  E. Pap Null-Additive Set Functions , 1995 .

[10]  Paul Mangiameli,et al.  The Effects and Interactions of Data Quality and Problem Complexity on Classification , 2011, JDIQ.

[11]  Adir Even,et al.  Utility-driven assessment of data quality , 2007, DATB.

[12]  Antoon Bronselaer,et al.  Ordinal Assessment of Data Consistency Based on Regular Expressions , 2016, IPMU.

[13]  Marcus Kaiser,et al.  Does the EU Insurance Mediation Directive Help to Improve Data Quality? A Metric- Based Analysis , 2008, ECIS.

[14]  Martin Hepp,et al.  Towards a vocabulary for data quality management in semantic web architectures , 2011, LWDM '11.

[15]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[16]  Doheon Lee,et al.  A Taxonomy of Dirty Data , 2004, Data Mining and Knowledge Discovery.

[17]  Donald P. Ballou,et al.  Modeling Completeness versus Consistency Tradeoffs in Information Decision Contexts , 2003, IEEE Trans. Knowl. Data Eng..

[18]  Ronald R. Yager,et al.  On ordered weighted averaging aggregation operators in multicriteria decision-making , 1988 .

[19]  Christoph Lange,et al.  daQ, an Ontology for Dataset Quality Information , 2014, LDOW.

[20]  Adir Even,et al.  Value-Driven Data Quality Assessment , 2005, ICIQ.

[21]  Thomas Redman,et al.  Data quality for the information age , 1996 .

[22]  Richard Y. Wang,et al.  Modeling Information Manufacturing Systems to Determine Information Product Quality Management Scien , 1998 .

[23]  Ronald R. Yager,et al.  On ordered weighted averaging aggregation operators in multicriteria decisionmaking , 1988, IEEE Trans. Syst. Man Cybern..

[24]  Bernd Heinrich,et al.  Metric-based data quality assessment - Developing and evaluating a probability-based currency metric , 2015, Decis. Support Syst..

[25]  Patrick Suppes,et al.  Additive and Polynomial Representations , 2014 .

[26]  G. Choquet Theory of capacities , 1954 .

[27]  S S Stevens,et al.  On the Theory of Scales of Measurement. , 1946, Science.

[28]  Felix Naumann,et al.  Completeness of integrated information sources , 2004, Inf. Syst..

[29]  Qing Chen,et al.  Repairing Functional Dependency Violations in Distributed Data , 2015, DASFAA.

[30]  E. F. Codd,et al.  Recent Investigations in Relational Data Base Systems , 1974, ACM Pacific.

[31]  Richard Y. Wang,et al.  Data quality assessment , 2002, CACM.

[32]  Felix Naumann,et al.  Profiling relational data: a survey , 2015, The VLDB Journal.

[33]  E. F. CODD,et al.  A relational model of data for large shared data banks , 1970, CACM.

[34]  Carlo Batini,et al.  Data Quality: Concepts, Methodologies and Techniques , 2006, Data-Centric Systems and Applications.

[35]  菅野 道夫,et al.  Theory of fuzzy integrals and its applications , 1975 .

[36]  Carlo Batini,et al.  Methodologies for data quality assessment and improvement , 2009, CSUR.

[37]  Adir Even,et al.  Understanding Impartial Versus Utility-Driven Quality Assessment In Large Datasets , 2007, ICIQ.

[38]  Shuai Ma,et al.  Improving Data Quality: Consistency and Accuracy , 2007, VLDB.

[39]  Mario Piattini,et al.  A Data Quality in Use model for Big Data , 2016, Future Gener. Comput. Syst..

[40]  Monique Snoeck,et al.  Towards a Precise Definition of Data Accuracy and a Justification for its Measure , 2016, ICIQ.