Probabilistic estimates of attribute statistics and match likelihood for people entity resolution

For big data practitioners, data integration/entity resolution/record linkage is one of the key challenges we face from day to day. Entity resolution/record linkage with high precision and recall on a large graph with billions of nodes, and hundreds of times more edges poses significant scalability challenges. Similarity based graph partition is still the most scalable method available. This paper presents a probabilistic method to approximate the match likelihood of a pair of records by incorporating values of different attributes and their aggregates/statistics. The quality of the approximates depend on the accuracy of the estimates of the aggregated values. The paper adapts the GTM model described in [1] to obtain the estimates. We present experimental results based on real world commercial data sources to show that the estimates obtained via GTM model is better than the baseline. Our experimental results also showed that the approximate match likelihood can improve the recall of the similarity function.

[1]  Hakan Kardes,et al.  Graph-based Approaches for Organization Entity Resolution in MapReduce , 2013, TextGraphs@EMNLP.

[2]  Tarek F. Abdelzaher,et al.  On truth discovery in social sensing: A maximum likelihood estimation approach , 2012, International Symposium on Information Processing in Sensor Networks.

[3]  Lise Getoor,et al.  Collective entity resolution in relational data , 2007, TKDD.

[4]  Charu C. Aggarwal,et al.  Recursive Fact-Finding: A Streaming Approach to Truth Estimation in Crowdsourcing Applications , 2013, 2013 IEEE 33rd International Conference on Distributed Computing Systems.

[5]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[6]  Jiawei Han,et al.  A Probabilistic Model for Estimating Real-valued Truth from Conflicting Sources , 2012 .

[7]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[8]  Xiaoxin Yin,et al.  Semi-supervised truth discovery , 2011, WWW.

[9]  Andrew Borthwick,et al.  Dynamic Record Blocking: Efficient Linking of Massive Databases in MapReduce , 2012 .

[10]  Xin Wang,et al.  CCF: Fast and scalable connected component computation in MapReduce , 2014, 2014 International Conference on Computing, Networking and Communications (ICNC).

[11]  Bo Zhao,et al.  A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration , 2012, Proc. VLDB Endow..

[12]  Sheng Chen The Case for Cost-Sensitive and Easy-To-Interpret Models in Industrial Record Linkage , 2011 .

[13]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[14]  Divesh Srivastava,et al.  Data Fusion: Resolving Conflicts from Multiple Sources , 2013, WAIM.

[15]  Divesh Srivastava,et al.  Big Data Integration , 2015, Synthesis Lectures on Data Management.

[16]  Mitul Tiwari,et al.  Entity Extraction, Linking, Classification, and Tagging for Social Media: A Wikipedia-Based Approach , 2013, Proc. VLDB Endow..

[17]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[18]  Divesh Srivastava,et al.  Data Fusion: Resolving Conflicts from Multiple Sources , 2013, WAIM.

[19]  Mikhail Bilenko and Raymond J. Mooney,et al.  On Evaluation and Training-Set Construction for Duplicate Detection , 2003 .

[20]  Peter Fankhauser,et al.  Efficient entity resolution for large heterogeneous information spaces , 2011, WSDM '11.

[21]  Divesh Srivastava,et al.  Truth Finding on the Deep Web: Is the Problem Solved? , 2012, Proc. VLDB Endow..

[22]  Divesh Srivastava,et al.  Less is More: Selecting Sources Wisely for Integration , 2012, Proc. VLDB Endow..

[23]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[24]  W. Winkler Overview of Record Linkage and Current Research Directions , 2006 .

[25]  Divesh Srivastava,et al.  Integrating Conflicting Data: The Role of Source Dependence , 2009, Proc. VLDB Endow..