Deep learning based approach to unstructured record linkage

Purpose In the world of big data, data integration technology is crucial for maximising the capability of data-driven decision-making. Integrating data from multiple sources drastically expands the power of information and allows us to address questions that are impossible to answer using a single data source. Record Linkage (RL) is a task of identifying and linking records from multiple sources that describe the same real world object (e.g. person), and it plays a crucial role in the data integration process. RL is challenging, as it is uncommon for different data sources to share a unique identifier. Hence, the records must be matched based on the comparison of their corresponding values. Most of the existing RL techniques assume that records across different data sources are structured and represented by the same scheme (i.e. set of attributes). Given the increasing amount of heterogeneous data sources, those assumptions are rather unrealistic. The purpose of this paper is to propose a novel RL model for unstructured data. Design/methodology/approach In the previous work (Jurek-Loughrey, 2020), the authors proposed a novel approach to linking unstructured data based on the application of the Siamese Multilayer Perceptron model. It was demonstrated that the method performed on par with other approaches that make constraining assumptions regarding the data. This paper expands the previous work originally presented at iiWAS2020 [16] by exploring new architectures of the Siamese Neural Network, which improves the generalisation of the RL model and makes it less sensitive to parameter selection. Findings The experimental results confirm that the new Autoencoder-based architecture of the Siamese Neural Network obtains better results in comparison to the Siamese Multilayer Perceptron model proposed in (Jurek et al., 2020). Better results have been achieved in three out of four data sets. Furthermore, it has been demonstrated that the second proposed (hybrid) architecture based on integrating the Siamese Autoencoder with a Multilayer Perceptron model, makes the model more stable in terms of the parameter selection. Originality/value To address the problem of unstructured RL, this paper presents a new deep learning based approach to improve the generalisation of the Siamese Multilayer Preceptron model and make is less sensitive to parameter selection.

[1]  Lei Le,et al.  Supervised autoencoders: Improving generalization performance with unsupervised regularizers , 2018, NeurIPS.

[2]  Kevin O'Hare,et al.  High-Value Token-Blocking: Efficient Blocking Method for Record Linkage , 2021, ACM Trans. Knowl. Discov. Data.

[3]  Ahmed K. Elmagarmid,et al.  TAILOR: a record linkage toolbox , 2002, Proceedings 18th International Conference on Data Engineering.

[4]  Divesh Srivastava,et al.  Big data integration , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[5]  Yann LeCun,et al.  Signature Verification Using A "Siamese" Time Delay Neural Network , 1993, Int. J. Pattern Recognit. Artif. Intell..

[6]  Gregory R. Koch,et al.  Siamese Neural Networks for One-Shot Image Recognition , 2015 .

[7]  Weiru Liu,et al.  A novel ensemble learning approach to unsupervised record linkage , 2017, Inf. Syst..

[8]  Mirella Lapata,et al.  Composition in Distributional Models of Semantics , 2010, Cogn. Sci..

[9]  Jens Lehmann,et al.  Wombat - A Generalization Approach for Automatic Link Discovery , 2017, ESWC.

[10]  Jonas Mueller,et al.  Siamese Recurrent Architectures for Learning Sentence Similarity , 2016, AAAI.

[11]  Axel-Cyrille Ngonga Ngomo,et al.  Unsupervised learning of link specifications: deterministic vs. non-deterministic , 2013, OM.

[12]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[13]  Robert Isele,et al.  Learning Expressive Linkage Rules using Genetic Programming , 2012, Proc. VLDB Endow..

[14]  Peter Christen,et al.  A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[15]  Daniel P. Miranker,et al.  Semi-supervised Instance Matching Using Boosted Classifiers , 2015, ESWC.

[16]  Ashwin Machanavajjhala,et al.  Entity Resolution: Theory, Practice & Open Challenges , 2012, Proc. VLDB Endow..

[17]  Shafiq R. Joty,et al.  DeepER - Deep Entity Resolution , 2017, ArXiv.

[18]  William E. Winkler,et al.  String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. , 1990 .

[19]  Daniel P. Miranker,et al.  An Unsupervised Algorithm for Learning Blocking Schemes , 2013, 2013 IEEE 13th International Conference on Data Mining.

[20]  Matthew A. Jaro,et al.  Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .

[21]  Claude E. Shannon,et al.  A mathematical theory of communication , 1948, MOCO.

[22]  Qing Wang,et al.  Efficient Interactive Training Selection for Large-Scale Entity Resolution , 2015, PAKDD.

[23]  Jeffrey Xu Yu,et al.  Entity Matching: How Similar Is Similar , 2011, Proc. VLDB Endow..

[24]  Arjun Mukherjee,et al.  Leveraging Social Media Signals for Record Linkage , 2018, WWW.

[25]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[26]  Pradeep Ravikumar,et al.  Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..

[27]  Panagiotis G. Ipeirotis,et al.  Duplicate Record Detection: A Survey , 2007 .

[28]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[29]  Andreas Thor,et al.  Evaluation of entity resolution approaches on real-world match problems , 2010, Proc. VLDB Endow..

[30]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[31]  P. Jaccard Distribution de la flore alpine dans le bassin des Dranses et dans quelques régions voisines , 1901 .

[32]  William W. Cohen,et al.  A Comparison of String Metrics for Matching Names and Records , 2003 .

[33]  EbraheemMuhammad,et al.  Distributed representations of tuples for entity resolution , 2018, VLDB 2018.