Unsupervised Blocking Key Selection for Real-Time Entity Resolution

Real-time entity resolution (ER) is the process of matching query records in sub-second time with records in a database that represent the same real-world entity. Indexing is a major step in the ER process, aimed at reducing the search space by bringing similar records closer to each other using a blocking key criterion. Selecting these keys is crucial for the effectiveness and efficiency of the real-time ER process. Traditional indexing techniques require domain knowledge for optimal key selection. However, to make the ER process less dependent on human domain knowledge, automatic selection of optimal blocking keys is required. In this paper we propose an unsupervised learning technique that automatically selects optimal blocking keys for building indexes that can be used in real-time ER. We specifically learn multiple keys to be used with multi-pass sorted neighbourhood, one of the most efficient and widely used indexing techniques for ER. We evaluate the proposed approach using three real-world data sets, and compare it with an existing automatic blocking key selection technique. The results show that our approach learns optimal blocking/sorting keys that are suitable for real-time ER. The learnt keys significantly increase the efficiency of query matching while maintaining the quality of matching results.

[1]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[2]  Dongwon Lee,et al.  HARRA: fast iterative hashed record linkage for large-scale data collections , 2010, EDBT '10.

[3]  Georgia Koutrika,et al.  Entity resolution with iterative blocking , 2009, SIGMOD Conference.

[4]  Peter Christen,et al.  A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[5]  Keizo Oyama,et al.  A Fast Linkage Detection Scheme for Multi-Source Information Integration , 2005, International Workshop on Challenges in Web Information Retrieval and Integration.

[6]  Peter Christen,et al.  Forest-Based Dynamic Sorted Neighborhood Indexing for Real-Time Entity Resolution , 2014, CIKM.

[7]  Divesh Srivastava,et al.  Big Data Integration , 2015, Synthesis Lectures on Data Management.

[8]  Huizhi Liang,et al.  Noise-Tolerant Approximate Blocking for Dynamic Real-Time Entity Resolution , 2014, PAKDD.

[9]  Daniel P. Miranker,et al.  An Unsupervised Algorithm for Learning Blocking Schemes , 2013, 2013 IEEE 13th International Conference on Data Mining.

[10]  Peter Christen,et al.  GeCo: an online personal data generator and corruptor , 2013, CIKM.

[11]  Phan H Giang A machine learning approach to create blocking criteria for record linkage , 2015, Health care management science.

[12]  Huizhi Liang,et al.  Dynamic Similarity-Aware Inverted Indexing for Real-Time Entity Resolution , 2013, PAKDD Workshops.

[13]  Panagiotis G. Ipeirotis,et al.  Duplicate Record Detection: A Survey , 2007 .

[14]  Yong Yu,et al.  Leveraging Unlabeled Data to Scale Blocking for Record Linkage , 2011, IJCAI.

[15]  Huizhi Liang,et al.  Dynamic Sorted Neighborhood Indexing for Real-Time Entity Resolution , 2014, ADC.

[16]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[17]  Yongtao Ma,et al.  TYPiMatch: type-specific unsupervised learning of keys and key values for heterogeneous web data integration , 2013, WSDM.

[18]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[19]  Ashwin Machanavajjhala,et al.  An automatic blocking mechanism for large-scale de-duplication tasks , 2012, CIKM '12.

[20]  Raymond J. Mooney,et al.  Adaptive Blocking: Learning to Scale Up Record Linkage , 2006, Sixth International Conference on Data Mining (ICDM'06).

[21]  Craig A. Knoblock,et al.  Learning Blocking Schemes for Record Linkage , 2006, AAAI.

[22]  Andreas Thor,et al.  Evaluation of entity resolution approaches on real-world match problems , 2010, Proc. VLDB Endow..