Fast Record Linkage for Company Entities

Record linkage is an essential part of nearly all real-world systems that consume structured and unstructured data coming from different sources. Typically no common key is available for connecting records. Massive data integration processes often have to be completed before any data analytics and further processing can be performed. In this work we focus on company entity matching, where company name, location and industry are taken into account. Our contribution is a highly scalable, enterprise-grade end-to-end system that uses rule-based linkage algorithms in combination with a machine learning approach to account for short company names. Linkage time is greatly reduced by an efficient decomposition of the search space using MinHash. Based on real-world ground truth datasets, we show that our approach reaches a recall of 91% compared to 73% for baseline approaches, while scaling linearly with the number of nodes used in the system.

[1]  Ping Li,et al.  b-Bit minwise hashing , 2009, WWW '10.

[2]  Anna Maria Di Sciullo,et al.  Natural Language Understanding , 2009, SoMeT.

[3]  Katsiaryna Mirylenka,et al.  Hidden Layer Models for Company Representations and Product Recommendations , 2019, EDBT.

[4]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[5]  Anshumali Shrivastava,et al.  Optimal Densification for Fast and Accurate Minwise Hashing , 2017, ICML.

[6]  Ashwin Machanavajjhala,et al.  Entity Resolution: Theory, Practice & Open Challenges , 2012, Proc. VLDB Endow..

[7]  Cihan H. Dagli,et al.  Entity Resolution Using Convolutional Neural Network , 2016 .

[8]  Prithviraj Sen,et al.  Active Learning for Large-Scale Entity Resolution , 2017, CIKM.

[9]  Wei Xu,et al.  Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[10]  Ping Li,et al.  In Defense of Minhash over Simhash , 2014, AISTATS.

[11]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[12]  Thanaa M. Ghanem,et al.  Record Linkage: A Machine Learning Approach, A Toolbox, and a Digital Government Web Service , 2003 .

[13]  Valter Crescenzi,et al.  Big Data Integration for Product Specifications , 2018, IEEE Data Eng. Bull..

[14]  Dirk Thomas,et al.  Improving Company Recognition from Unstructured Text by using Dictionaries , 2017, EDBT.

[15]  AnHai Doan,et al.  Magellan: Toward Building Entity Matching Management Systems over Data Science Stacks , 2016, Proc. VLDB Endow..

[16]  Katsiaryna Mirylenka,et al.  Linking IT Product Records , 2019, PKDD/ECML Workshops.

[17]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[18]  Alan M. Frieze,et al.  Min-wise independent permutations (extended abstract) , 1998, STOC '98.

[19]  James Inman,et al.  Navigation and Nautical Astronomy: For the Use of British Seamen , 2009 .

[20]  Lifang Gu,et al.  Record Linkage: Current Practice and Future Directions , 2003 .

[21]  Otmar Ertl SuperMinHash - A New Minwise Hashing Algorithm for Jaccard Similarity Estimation , 2017, ArXiv.

[22]  Piotr Indyk,et al.  Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..

[23]  Markus Stumptner,et al.  Certus: An Effective Entity Resolution Approach with Graph Differential Dependencies (GDDs) , 2019, Proc. VLDB Endow..

[24]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[25]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[26]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[27]  Anand Rajaraman,et al.  Mining of Massive Datasets , 2011 .

[28]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[29]  Theodoros Rekatsinas,et al.  Deep Learning for Entity Matching: A Design Space Exploration , 2018, SIGMOD Conference.

[30]  Sean M. Randall,et al.  The effect of data cleaning on record linkage quality , 2013, BMC Medical Informatics and Decision Making.