HARRA: fast iterative hashed record linkage for large-scale data collections

We study the performance issue of the "iterative" record linkage (RL) problem, where match and merge operations may occur together in iterations until convergence emerges. We first propose the Iterative Locality-Sensitive Hashing (ILSH) that dynamically merges LSH-based has tables for quick and accurate blocking. Then, by exploiting inherent characteristics within/across data sets, we develop a suite of I-LSH-based RL algorithms, named as HARRA (<u>HA</u>shed <u>R</u>eco<u>R</u>d link<u>A</u>ge). The superiority of HARRA in speed over competing RL solutions is thoroughly validated using various real data sets. While maintaining equivalent or comparable accuracy levels, for instance, HARRA runs: (1) 4.5 and 10.5 times faster than StringMap and R-Swoosh in iteratively linking 4,000 x 4,000 short records (i.e., one of the small test cases), and (2) 5.6 and 3.4 times faster than basic LSH and Multi-Probe LSH algorithms in iteratively linking 400,000 x 400,000 long records (i.e., the largest test case).

[1]  Craig A. Knoblock,et al.  Learning Blocking Schemes for Record Linkage , 2006, AAAI.

[2]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[3]  Byung-Won On,et al.  Comparative study of name disambiguation problem using a scalable blocking-based framework , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[4]  Edith Cohen,et al.  Finding interesting associations without support pruning , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[5]  S. Russell Identity Uncertainty , 2010, Encyclopedia of Machine Learning.

[6]  Lise Getoor,et al.  Collective entity resolution in relational data , 2007, TKDD.

[7]  Dongwon Lee,et al.  Parallel linkage , 2007, CIKM '07.

[8]  Jennifer Widom,et al.  Swoosh: a generic approach to entity resolution , 2008, The VLDB Journal.

[9]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[10]  Alexandr Andoni,et al.  Efficient algorithms for substring near neighbor problem , 2006, SODA '06.

[11]  Edith Cohen,et al.  Finding Interesting Associations without Support Pruning , 2001, IEEE Trans. Knowl. Data Eng..

[12]  Divesh Srivastava,et al.  Group Linkage , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[13]  Peter Christen,et al.  Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface , 2008, KDD.

[14]  Xin Li,et al.  Constraint-Based Entity Matching , 2005, AAAI.

[15]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[16]  Stuart J. Russell,et al.  Identity Uncertainty and Citation Matching , 2002, NIPS.

[17]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[18]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[19]  Sunita Sarawagi,et al.  Efficient set joins on similarity predicates , 2004, SIGMOD '04.

[20]  Zhe Wang,et al.  Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search , 2007, VLDB.

[21]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[22]  Raymond J. Mooney,et al.  Adaptive Blocking: Learning to Scale Up Record Linkage , 2006, Sixth International Conference on Data Mining (ICDM'06).

[23]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[24]  Hector Garcia-Molina,et al.  Generic Entity Resolution with Data Confidences , 2006, CleanDB.

[25]  Paul M. B. Vitányi,et al.  The Google Similarity Distance , 2004, IEEE Transactions on Knowledge and Data Engineering.

[26]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[27]  Chen Li,et al.  Supporting Efficient Record Linkage for Large Data Sets Using Mapping Techniques , 2006, World Wide Web.

[28]  Dmitri V. Kalashnikov,et al.  Exploiting Relationships for Domain-Independent Data Cleaning , 2005, SDM.

[29]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[30]  Panagiotis Papapetrou,et al.  Nearest Neighbor Retrieval Using Distance-Based Hashing , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[31]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[32]  Luis Gravano,et al.  Text joins in an RDBMS for web data integration , 2003, WWW '03.

[33]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[34]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.