Indeterministic Handling of Uncertain Decisions in Duplicate Detection

In current research, duplicate detection is usually considered as a deterministic approach in which tuples are either declared as duplicates or not. However, most often it is not completely clear whether two tuples represent the same real-world entity or not. In deterministic approaches, however, this uncertainty is ignored, which in turn can lead to false decisions. In this paper, we present an indeterministic approach for handling uncertain decisions in a duplicate detection process by using a probabilistic target schema. Thus, instead of deciding between multiple possible worlds, all these worlds can be modeled in the resulting data. This approach minimizes the negative impacts of false decisions. Furthermore, the duplicate detection process becomes almost fully automatic and human effort can be reduced to a large extent. Unfortunately, a full-indeterministic approach is by definition too expensive (in time as well as in storage) and hence impractical. For that reason, we additionally introduce several semi-indeterministic methods for heuristically reducing the set of indeterministic handled decisions in a meaningful way.

[1]  Andrew McCallum,et al.  Conditional Models of Identity Uncertainty with Application to Noun Coreference , 2004, NIPS.

[2]  William W. Cohen,et al.  Learning to match and cluster large high-dimensional data sets for data integration , 2002, KDD.

[3]  Norbert Ritter,et al.  Duplicate detection in probabilistic data , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[4]  Peter Buneman,et al.  Provenance in databases , 2009, SIGMOD '07.

[5]  Maurice van Keulen,et al.  Qualitative effects of knowledge rules and user feedback in probabilistic data integration , 2009, The VLDB Journal.

[6]  G. Rota The Number of Partitions of a Set , 1964 .

[7]  Heiko Mueller,et al.  Problems , Methods , and Challenges in Comprehensive Data Cleansing , 2005 .

[8]  Maurice van Keulen,et al.  IMPrECISE: Good-is-good-enough data integration , 2008, ICDE.

[9]  Felix Naumann,et al.  Data fusion , 2009, CSUR.

[10]  Divesh Srivastava,et al.  Flexible String Matching Against Large Databases in Practice , 2004, VLDB.

[11]  Christoph E. Koch MayBMS: A System for Managing Large Uncertain and Probabilistic Databases , 2009 .

[12]  Joann J. Ordille,et al.  Data integration: the teenage years , 2006, VLDB.

[13]  Parag Agrawal,et al.  Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS (Demo) , 2007, CIDR.

[14]  Jennifer Widom,et al.  ULDBs: databases with uncertainty and lineage , 2006, VLDB.

[15]  Jennifer Widom,et al.  Swoosh: a generic approach to entity resolution , 2008, The VLDB Journal.

[16]  Olga Brazhnik,et al.  Anatomy of data integration , 2007, J. Biomed. Informatics.

[17]  Carlo Batini,et al.  Data Quality: Concepts, Methodologies and Techniques , 2006, Data-Centric Systems and Applications.

[18]  Carlo Batini,et al.  Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications) , 2006 .

[19]  Stuart E. Madnick,et al.  The inter-database instance identification problem in integrating autonomous systems , 1989, [1989] Proceedings. Fifth International Conference on Data Engineering.

[20]  Pradeep Ravikumar,et al.  A Hierarchical Graphical Model for Record Linkage , 2004, UAI.

[21]  Robert L. Surowka Modeling and querying possible repairs in duplicate detection , 2010 .

[22]  Dan Olteanu,et al.  MayBMS: a probabilistic database management system , 2009, SIGMOD Conference.

[23]  David Maier,et al.  Principles of dataspace systems , 2006, PODS '06.

[24]  Hector Garcia-Molina,et al.  Generic entity resolution with negative rules , 2009, The VLDB Journal.

[25]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[26]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[27]  Rajeev Motwani,et al.  Robust identification of fuzzy duplicates , 2005, 21st International Conference on Data Engineering (ICDE'05).

[28]  Parag Agrawal,et al.  Trio: a system for data, uncertainty, and lineage , 2006, VLDB.

[29]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[30]  Sarath Kumar Kondreddi,et al.  A Probabilistic XML Approach to Data Integration , 2009 .

[31]  Hector Garcia-Molina,et al.  The Management of Probabilistic Data , 1992, IEEE Trans. Knowl. Data Eng..

[32]  Arbee L. P. Chen,et al.  Answering heterogeneous database queries with degrees of uncertainty , 2005, Distributed and Parallel Databases.

[33]  H B NEWCOMBE,et al.  Automatic linkage of vital records. , 1959, Science.

[34]  Maurizio Lenzerini,et al.  Data integration: a theoretical perspective , 2002, PODS.