Entity resolution for uncertain data

Entity resolution (ER), also known as duplicate detection or record matching, is the problem of identifying the tuples that represent the same real world entity. In this paper, we address the problem of ER for uncertain data, which we call ERUD. We propose two different approaches for the ERUD problem based on two classes of similarity functions, i.e. context-free and context-sensitive. We propose a PTIME algorithm for context-free similarity functions, and a Monte Carlo algorithm for context-sensitive similarity functions. Existing context-sensitive similarity functions need at least one pass over the database to compute some statistical features of data, which makes it very inefficient for our Monte Carlo algorithm. Thus, we propose a novel context-sensitive similarity function that makes our Monte Carlo algorithm more efficient. To further improve the efficiency of our proposed Monte Carlo algorithm, we propose a parallel version of it using the MapReduce framework. We validated our algorithms through experiments over both synthetic and real datasets. Our performance evaluation shows the effectiveness of our algorithms in terms of success rate and response time.

[1]  William W. Cohen Integration of heterogeneous databases without common domains using queries based on textual similarity , 1998, SIGMOD '98.

[2]  Roberto Tamassia,et al.  Continuous probabilistic nearest-neighbor queries for uncertain trajectories , 2009, EDBT '09.

[3]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[4]  Charles Elkan,et al.  An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.

[5]  Jian Pei,et al.  Ranking queries on uncertain data: a probabilistic threshold approach , 2008, SIGMOD Conference.

[6]  Jian Pei,et al.  Superseding Nearest Neighbor Search on Uncertain Spatial Databases , 2010, IEEE Transactions on Knowledge and Data Engineering.

[7]  Feifei Li,et al.  Semantics of Ranking Queries for Probabilistic Data and Expected Ranks , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[8]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[9]  Divesh Srivastava,et al.  Flexible String Matching Against Large Databases in Practice , 2004, VLDB.

[10]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[11]  Jeffrey Xu Yu,et al.  Sliding-window top-k queries on uncertain streams , 2008, The VLDB Journal.

[12]  Mohamed A. Soliman,et al.  Top-k Query Processing in Uncertain Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[13]  Bin Jiang,et al.  Probabilistic Skylines on Uncertain Data , 2007, VLDB.

[14]  Pradeep Ravikumar,et al.  Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..

[15]  Hamideh Afsarmanesh,et al.  Pay-As-You-Go Data Integration Using Functional Dependencies , 2012, CD-ARES.

[16]  Feifei Li,et al.  Efficient Processing of Top-k Queries in Uncertain Databases with x-Relations , 2008, IEEE Transactions on Knowledge and Data Engineering.

[17]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[18]  Rodrigo Gonçalves,et al.  Approximate data instance matching: a survey , 2011, Knowledge and Information Systems.

[19]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[20]  Mikhail J. Atallah,et al.  Computing all skyline probabilities for uncertain data , 2009, PODS.

[21]  Hector Garcia-Molina,et al.  Generic Entity Resolution with Data Confidences , 2006, CleanDB.

[22]  Val Tannen,et al.  Models for Incomplete and Probabilistic Information , 2006, IEEE Data Eng. Bull..

[23]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[24]  Jennifer Widom,et al.  Working Models for Uncertain Data , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[25]  Sumit Sarkar,et al.  Entity matching in heterogeneous databases: a distance-based decision model , 1998, Proceedings of the Thirty-First Hawaii International Conference on System Sciences.

[26]  Sudipto Guha,et al.  Merging the Results of Approximate Match Operations , 2004, VLDB.

[27]  Daniel Deutch,et al.  TOP-K projection queries for probabilistic business processes , 2009, ICDT '09.

[28]  H. V. Jagadish,et al.  ProTDB: Probabilistic Data in XML , 2002, VLDB.

[29]  Hua Lu,et al.  Probabilistic threshold k nearest neighbor queries over moving objects in symbolic indoor space , 2010, EDBT '10.

[30]  Sriram Raghavan,et al.  Avatar Information Extraction System , 2006, IEEE Data Eng. Bull..

[31]  Parag Agrawal,et al.  Trio: a system for data, uncertainty, and lineage , 2006, VLDB.

[32]  Craig A. Knoblock,et al.  Learning object identification rules for information integration , 2001, Inf. Syst..

[33]  Dan Olteanu,et al.  Fast and Simple Relational Processing of Uncertain Data , 2007, 2008 IEEE 24th International Conference on Data Engineering.

[34]  Rahul Gupta,et al.  Creating probabilistic databases from information extraction models , 2006, VLDB.

[35]  Serge Abiteboul,et al.  On the expressiveness of probabilistic XML models , 2009, The VLDB Journal.

[36]  Christopher Ré,et al.  Efficient Top-k Query Evaluation on Probabilistic Data , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[37]  Anastasia Ailamaki,et al.  Challenges inbuilding a DBMS Resource Advisor , 2006, IEEE Data Eng. Bull..

[38]  Yufei Tao,et al.  Efficient Evaluation of Probabilistic Advanced Spatial Queries on Existentially Uncertain Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[39]  Wei Hong,et al.  Model-Driven Data Acquisition in Sensor Networks , 2004, VLDB.

[40]  William E. Winkler,et al.  AN APPLICATION OF THE FELLEGI-SUNTER MODEL OF RECORD LINKAGE TO THE 1990 U.S. DECENNIAL CENSUS , 1987 .

[41]  Norbert Ritter,et al.  Duplicate detection in probabilistic data , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[42]  Yehoshua Sagiv,et al.  Query evaluation over probabilistic XML , 2009, The VLDB Journal.

[43]  Jennifer Widom,et al.  ULDBs: databases with uncertainty and lineage , 2006, VLDB.