Evaluating entity-description conflict on duplicated data

Duplicated records, which describe the same entity in the real world, frequently generated by data integration. Ideally, the values on the same attributes of duplicated records should be identical. However, the duplicated records may have conflicting values on the same attributes due to ambiguity and data errors. Obviously, the more the conflicts there are among duplicated records in a data set, the poorer the quality of the data set is. To address the problem, we explore a new data quality measure, entity-description conflict, to evaluate the conflict on duplicated records. Since current entity resolution algorithms can hardly identify duplicated records correctly and completely, it brings challenges to compute the entity-description conflict. To this end, it is studied to compute the range of the entity-description conflict while the entity resolution result is not completely correct in this paper. (1) The mathematics model of the entity-description conflict is introduced. (2) Four primary operators for computing the range of the entity-description conflict are identified and are proved to be NP-hard, and thus it is proved that the problem of computing the range of the entity-description conflict is NP-hard. (3) Four approximation algorithms for the four primary operators are provided and a framework based on the four primary operators is proposed for computing the range of the entity-description conflict. (4) Using real-life data and synthetic data, the effectiveness and efficiency of the proposed algorithms are experimentally verified.

[1]  Hector Garcia-Molina,et al.  Joint Entity Resolution , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[2]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[3]  Richard Y. Wang,et al.  Data quality assessment , 2002, CACM.

[4]  Philip S. Yu,et al.  Truth Discovery with Multiple Conflicting Information Providers on the Web , 2008, IEEE Trans. Knowl. Data Eng..

[5]  Avrim Blum,et al.  Correlation Clustering , 2004, Machine Learning.

[6]  Thomas Redman,et al.  The impact of poor data quality on the typical enterprise , 1998, CACM.

[7]  Nilesh N. Dalvi,et al.  Large-Scale Collective Entity Matching , 2011, Proc. VLDB Endow..

[8]  H B NEWCOMBE,et al.  Automatic linkage of vital records. , 1959, Science.

[9]  Weiyi Meng,et al.  A Latent Topic Model for Complete Entity Resolution , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[10]  Surajit Chaudhuri,et al.  Transformation-based Framework for Record Matching , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[11]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[12]  Matthew A. Jaro,et al.  Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .

[13]  Divesh Srivastava,et al.  Integrating Conflicting Data: The Role of Source Dependence , 2009, Proc. VLDB Endow..

[14]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[15]  Jianyong Wang,et al.  On Graph-Based Name Disambiguation , 2011, JDIQ.

[16]  Amélie Marian,et al.  Corroborating Answers from Multiple Web Sources , 2007, WebDB.

[17]  Surajit Chaudhuri,et al.  Learning String Transformations From Examples , 2009, Proc. VLDB Endow..

[18]  Georgia Koutrika,et al.  Entity resolution with iterative blocking , 2009, SIGMOD Conference.

[19]  Felix Naumann,et al.  Data fusion , 2009, CSUR.

[20]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[21]  Luis Gravano,et al.  Text joins in an RDBMS for web data integration , 2003, WWW '03.

[22]  William W. Cohen Integration of heterogeneous databases without common domains using queries based on textual similarity , 1998, SIGMOD '98.

[23]  Divesh Srivastava,et al.  Record linkage: similarity measures and algorithms , 2006, SIGMOD Conference.

[24]  William W. Cohen,et al.  Learning to match and cluster large high-dimensional data sets for data integration , 2002, KDD.

[25]  Pedro M. Domingos,et al.  Object Identification with Attribute-Mediated Dependences , 2005, PKDD.

[26]  Hector Garcia-Molina,et al.  Entity resolution with evolving rules , 2010, Proc. VLDB Endow..

[27]  Divesh Srivastava,et al.  Truth Discovery and Copying Detection in a Dynamic World , 2009, Proc. VLDB Endow..

[28]  Surajit Chaudhuri,et al.  Example-driven design of efficient record matching queries , 2007, VLDB.

[29]  Craig W. Fisher,et al.  In Search Of An Accuracy Metric , 2007, MIT International Conference on Information Quality.

[30]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[31]  Veda C. Storey,et al.  A Framework for Analysis of Data Quality Research , 1995, IEEE Trans. Knowl. Data Eng..

[32]  Raymond J. Mooney,et al.  Adaptive Blocking: Learning to Scale Up Record Linkage , 2006, Sixth International Conference on Data Mining (ICDM'06).

[33]  Craig A. Knoblock,et al.  Learning object identification rules for information integration , 2001, Inf. Syst..

[34]  George V. Moustakides,et al.  A Bayesian decision model for cost optimal record matching , 2003, The VLDB Journal.

[35]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[36]  Renée J. Miller,et al.  Framework for Evaluating Clustering Algorithms in Duplicate Detection , 2009, Proc. VLDB Endow..

[37]  Lise Getoor,et al.  Collective entity resolution in relational data , 2007, TKDD.