Determining the relative accuracy of attributes

The relative accuracy problem is to determine, given tuples <i>t</i><sub>1</sub> and <i>t</i><sub>2</sub> that refer to the same entity <i>e</i>, whether <i>t</i><sub>1</sub>[<i>A</i>] is more accurate than <i>t</i><sub>2</sub><i>A</i>, i.e., <i>t</i><sub>1</sub><i>A</i> is closer to the true value of the <i>A</i> attribute of <i>e</i> than <i>t</i><sub>2</sub><i>A</i>. This has been a longstanding issue for data quality, and is challenging when the true values of <i>e</i> are unknown. This paper proposes a model for determining relative accuracy. (1) We introduce a class of accuracy rules and an inference system with a chase procedure, to deduce relative accuracy. (2) We identify and study several fundamental problems for relative accuracy. Given a set <i>I</i><sub>e</sub> of tuples pertaining to the same entity <i>e</i> and a set of accuracy rules, these problems are to decide whether the chase process terminates, is Church-Rosser, and leads to a unique target tuple <i>t</i><sub>e</sub> composed of the most accurate values from <i>I</i><sub>e</sub> for all the attributes of <i>e</i>. (3) We propose a framework for inferring accurate values with user interaction. (4) We provide algorithms underlying the framework, to find the unique target tuple <i>t</i><sub>e</sub> whenever possible; when there is no enough information to decide a complete <i>t</i><sub>e</sub>, we compute top-<i>k</i> candidate targets based on a preference model. (5) Using real-life and synthetic data, we experimentally verify the effectiveness and efficiency of our method.

[1]  Ihab F. Ilyas,et al.  A survey of top-k query processing techniques in relational database systems , 2008, CSUR.

[2]  Serge Abiteboul,et al.  Corroborating information from disagreeing views , 2010, WSDM '10.

[3]  Renée J. Miller,et al.  Discovering data quality rules , 2008, Proc. VLDB Endow..

[4]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS '01.

[5]  Lorenzo Blanco,et al.  Probabilistic Models to Reconcile Complex Data from Inaccurate Data Sources , 2010, CAiSE.

[6]  Craig W. Fisher,et al.  An Accuracy Metric: Percentages, Randomness, and Probabilities , 2009, JDIQ.

[7]  Divesh Srivastava,et al.  Truth Discovery and Copying Detection in a Dynamic World , 2009, Proc. VLDB Endow..

[8]  Carlo Batini,et al.  Data Quality: Concepts, Methodologies and Techniques , 2006, Data-Centric Systems and Applications.

[9]  Peter A. Flach,et al.  Confirmation-Guided Discovery of First-Order Rules with Tertius , 2004, Machine Learning.

[10]  Jennifer Widom,et al.  Trio: A System for Integrated Management of Data, Accuracy, and Lineage , 2004, CIDR.

[11]  R. Varshney,et al.  Supporting top-k join queries in relational databases , 2011 .

[12]  Gerth Stølting Brodal,et al.  Worst-case efficient priority queues , 1996, SODA '96.

[13]  Wenfei Fan,et al.  Inferring data currency and consistency for conflict resolution , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[14]  Shuai Ma,et al.  Interaction between Record Matching and Data Repairing , 2014, JDIQ.

[15]  Amélie Marian,et al.  A framework for corroborating answers from multiple web sources , 2011, Inf. Syst..

[16]  C MatheusCarolyn,et al.  An Accuracy Metric , 2009 .

[17]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[18]  Moustafa Chenine,et al.  Enterprise Architecture Analysis for Data Accuracy Assessments , 2009, 2009 IEEE International Enterprise Distributed Object Computing Conference.

[19]  Serge Abiteboul,et al.  Foundations of Databases , 1994 .

[20]  Wenfei Fan,et al.  Foundations of Data Quality Management , 2012, Foundations of Data Quality Management.

[21]  Wenfei Fan,et al.  Conditional functional dependencies for capturing data inconsistencies , 2008, TODS.

[22]  Felix Naumann,et al.  An Introduction to Duplicate Detection , 2010, An Introduction to Duplicate Detection.

[23]  Elena Console,et al.  Data Fusion , 2009, Encyclopedia of Database Systems.

[24]  Martin J. Eppler Managing Information Quality , 2003 .

[25]  Philip S. Yu,et al.  Truth Discovery with Multiple Conflicting Information Providers on the Web , 2007, IEEE Transactions on Knowledge and Data Engineering.

[26]  Anany Levitin,et al.  The Notion of Data and Its Quality Dimensions , 1994, Inf. Process. Manag..

[27]  Bo Zhao,et al.  A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration , 2012, Proc. VLDB Endow..

[28]  Irit Askira Gelman,et al.  Setting priorities for data accuracy improvements in satisficing decision-making scenarios: A guiding theory , 2010, Decis. Support Syst..

[29]  Laks V. S. Lakshmanan,et al.  Data Cleaning and Query Answering with Matching Dependencies and Matching Functions , 2010, ICDT '11.

[30]  Neoklis Polyzotis,et al.  Evaluating rank joins with optimal cost , 2008, PODS.

[31]  Carlo Batini,et al.  Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications) , 2006 .