Incremental Entity Resolution from Linked Documents

In many government applications we often find that information about entities, such as persons, are available in disparate data sources such as passports, driving licences, bank accounts, and income tax records. Similar scenarios are commonplace in large enterprises having multiple customer, supplier, or partner databases. Each data source maintains different aspects of an entity, and resolving entities based on these attributes is a well-studied problem. However, in many cases documents in one source reference those in others; e.g., a person may provide his driving-licence number while applying for a passport, or vice-versa. These links define relationships between documents of the same entity (as opposed to inter-entity relationships, which are also often used for resolution). In this paper we describe an algorithm to cluster documents that are highly likely to belong to the same entity by exploiting inter-document references in addition to attribute similarity. Our technique uses a combination of iterative graph-traversal, locality-sensitive hashing, iterative match-merge, and graph-clustering to discover unique entities based on a document corpus. A unique feature of our technique is that new sets of documents can be added incrementally while having to re-resolve only a small subset of a previously resolved entity-document collection. We present performance and quality results on two data-sets: a real-world database of companies and a large synthetically generated `population' database. We also demonstrate benefit of using inter-document references for clustering in the form of enhanced recall of documents for resolution.

[1]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[2]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[3]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[4]  S. Sudarshan,et al.  Keyword searching and browsing in databases using BANKS , 2002, Proceedings 18th International Conference on Data Engineering.

[5]  Peter Christen,et al.  Towards Scalable Real-Time Entity Resolution using a Similarity-Aware Inverted Index Approach , 2008, AusDM.

[6]  Gautam Shroff,et al.  Approximate Incremental Big-Data Harmonization , 2013, 2013 IEEE International Congress on Big Data.

[7]  ChristenPeter A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012 .

[8]  Yehoshua Sagiv,et al.  Finding and approximating top-k answers in keyword proximity search , 2006, PODS '06.

[9]  Hector Garcia-Molina,et al.  Pay-As-You-Go Entity Resolution , 2013, IEEE Transactions on Knowledge and Data Engineering.

[10]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[11]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[12]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[13]  Wolfgang Nejdl,et al.  Efficient Incremental Near Duplicate Detection Based on Locality Sensitive Hashing , 2010, DEXA.

[14]  Dongwon Lee,et al.  HARRA: fast iterative hashed record linkage for large-scale data collections , 2010, EDBT '10.

[15]  András A. Benczúr,et al.  Infrastructures and bound for distributed entity resolution , 2011 .

[16]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[17]  Lise Getoor,et al.  Collective entity resolution in relational data , 2007, TKDD.

[18]  Claudia Niederée,et al.  Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data , 2012, WSDM '12.

[19]  Lise Getoor,et al.  Query-time entity resolution , 2006, KDD '06.

[20]  Divesh Srivastava,et al.  Record linkage with uniqueness constraints and erroneous values , 2010, Proc. VLDB Endow..

[21]  Jennifer Widom,et al.  Swoosh: a generic approach to entity resolution , 2008, The VLDB Journal.

[22]  Panagiotis G. Ipeirotis,et al.  Duplicate Record Detection: A Survey , 2007 .

[23]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[24]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[25]  Erhard Rahm,et al.  Generic Schema Matching with Cupid , 2001, VLDB.