Incremental entity fusion from linked documents

In many government applications, especially for intelligence and law-enforcement, we often find that information about entities, such as persons or even companies, are available in disparate data sources. For example, information distributed across passports, driving licences, bank accounts, and income tax documents that need to be resolved and fused to reveal a consolidated profile of an individual. In this paper we describe an algorithm to fuse documents that are highly likely to belong to the same entity by exploiting inter-document references in addition to attribute similarity. Our technique uses a combination of iterative graph-traversal, locality-sensitive hashing, iterative match-merge, and graph-clustering to discover unique entities based on a document corpus. Further, new sets of documents can be added incrementally while having to re-process only a small subset of a previously fused entity-document collection. We present performance and quality results via both Bayesian likelihood fusion as well as using Support Vector Machines to demonstrate benefit of using inter-document references, both to improve accuracy as well as for detecting attempts at deliberate obfuscation.

[1]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[2]  Peter Christen,et al.  A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[3]  Yehoshua Sagiv,et al.  Finding and approximating top-k answers in keyword proximity search , 2006, PODS '06.

[4]  András A. Benczúr,et al.  Infrastructures and bound for distributed entity resolution , 2011 .

[5]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[6]  Lise Getoor,et al.  Collective entity resolution in relational data , 2007, TKDD.

[7]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[8]  Gautam Shroff,et al.  Graph-Parallel Entity Resolution using LSH & IMM , 2014, EDBT/ICDT Workshops.

[9]  Lise Getoor,et al.  Query-time entity resolution , 2006, KDD '06.

[10]  Claudia Niederée,et al.  Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data , 2012, WSDM '12.

[11]  Peter Christen,et al.  Towards Scalable Real-Time Entity Resolution using a Similarity-Aware Inverted Index Approach , 2008, AusDM.

[12]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[13]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[14]  Divesh Srivastava,et al.  Record linkage with uniqueness constraints and erroneous values , 2010, Proc. VLDB Endow..

[15]  Erhard Rahm,et al.  Generic Schema Matching with Cupid , 2001, VLDB.

[16]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[17]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[18]  Jennifer Widom,et al.  Swoosh: a generic approach to entity resolution , 2008, The VLDB Journal.

[19]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[20]  Gautam Shroff,et al.  Incremental Entity Resolution from Linked Documents , 2014, ArXiv.

[21]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[22]  Gautam Shroff,et al.  Approximate Incremental Big-Data Harmonization , 2013, 2013 IEEE International Congress on Big Data.

[23]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[24]  Dongwon Lee,et al.  HARRA: fast iterative hashed record linkage for large-scale data collections , 2010, EDBT '10.

[25]  Wolfgang Nejdl,et al.  Efficient Incremental Near Duplicate Detection Based on Locality Sensitive Hashing , 2010, DEXA.