Dependency Discovery in Data Quality

A conceptual framework for the automatic discovery of dependencies between data quality dimensions is described. Dependency discovery consists in recovering the dependency structure for a set of data quality dimensions measured on attributes of a database. This task is accomplished through the data mining methodology, by learning a Bayesian Network from a database. The Bayesian Network is used to analyze dependency between data quality dimensions associated with different attributes. The proposed framework is instantiated on a real world database. The task of dependency discovery is presented in the case when the following data quality dimensions are considered; accuracy, completeness, and consistency. The Bayesian Network model shows how data quality can be improved while satisfying budget constraints.

[1]  Donald P. Ballou,et al.  Modeling Completeness versus Consistency Tradeoffs in Information Decision Contexts , 2003, IEEE Trans. Knowl. Data Eng..

[2]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[3]  David Heckerman,et al.  A Tutorial on Learning with Bayesian Networks , 1999, Innovations in Bayesian Networks.

[4]  F. Burstein,et al.  Handbook on Decision Support Systems 1 , 2008 .

[5]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[6]  Thomas K. Houston,et al.  Data Quality in the Outpatient Setting: Impact on Clinical Decision Support Systems , 2005, AMIA.

[7]  H.,et al.  Evolving information systems : meeting the ever-changing environment , 2007 .

[8]  R. Reiter On Closed World Data Bases , 1987, Logic and Data Bases.

[9]  Zbigniew J. Gackowski Logical Interdependence of Some Attributes of Data/Information Quality , 2004, ICIQ.

[10]  Matthias Jarke,et al.  Architecture and Quality in Data Warehouses: An Extended Repository Approach , 1999, Information Systems.

[11]  Diane M. Strong,et al.  AIMQ: a methodology for information quality assessment , 2002, Inf. Manag..

[12]  Finn V. Jensen,et al.  Bayesian Networks and Decision Graphs , 2001, Statistics for Engineering and Information Science.

[13]  Donald P. Ballou,et al.  Designing Information Systems to Optimize the Accuracy-Timeliness Tradeoff , 1995, Inf. Syst. Res..

[14]  Felix Naumann,et al.  Data fusion , 2009, CSUR.

[15]  Craig W. Fisher,et al.  Introduction to Information Quality , 2006 .

[16]  David Heckerman,et al.  Learning With Bayesian Networks (Abstract) , 1995, ICML.

[17]  Pierre Baldi,et al.  Modeling the Internet and the Web: Probabilistic Methods and Algorithms: Baldi/Probabilistic , 2002 .

[18]  Carlo Batini,et al.  An Analytical Framework to Analyze Dependencies Among Data Quality Dimensions , 2006, ICIQ.

[19]  Carlo Batini,et al.  Methodologies for data quality assessment and improvement , 2009, CSUR.

[20]  Renée J. Miller,et al.  Discovering data quality rules , 2008, Proc. VLDB Endow..

[21]  Qi Han,et al.  Addressing timeliness/accuracy/cost tradeoffs in information collection for dynamic environments , 2003, RTSS 2003. 24th IEEE Real-Time Systems Symposium, 2003.