Towards a Record Linkage Layer to Support Big Data Integration

Record linkage is a crucial step in big data integration (BDI). It is also one of its major challenges with the increasing number of structured data sources that need to be linked and do not share common attributes. Our research-in-progress aims to develop a record linkage layer that assists data scientist in integrating a variety of data sources. A structured literature review of 68 papers reveals (1) key data sets, (2) available classification algorithms (match or no match), and (3) similarity measures to consider in BDI projects. The results highlight the foundational requirements for the development of the record linkage layer such as processing unstructured attributes. As BDI emerges as a priority for industry, our work proposes a record linkage layer that provide similarity measures and integration algorithms while assisting its selection. A record linkage layer can contribute to big data adoption in industry settings and improve quality of big data integration processes to effectively support business decision-making.

[1]  Justin Y. Shi,et al.  Identity Tracking in Big Data: Preliminary Research Using In-Memory Data Graph Models for Record Linkage and Probabilistic Signature Hashing for Approximate String Matching in Big Health and Human Services Databases , 2014, BigDataScience '14.

[2]  Richard T. Watson,et al.  Analyzing the Past to Prepare for the Future: Writing a Literature Review , 2002, MIS Q..

[3]  Andreas Thor,et al.  Tailoring entity resolution for matching product offers , 2012, EDBT '12.

[4]  P. Mayring Qualitative content analysis: theoretical foundation, basic procedures and software solution , 2014 .

[5]  María José Escalona Cuaresma,et al.  Entity Identity Reconciliation based Big Data Federation-A MDE approach , 2015, ISD.

[6]  Nihel Kooli,et al.  Deep Learning Based Approach for Entity Resolution in Databases , 2018, ACIIDS.

[7]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[8]  Eman S. Nasr,et al.  Record linkage approaches in big data: A state of art study , 2017, 2017 13th International Computer Engineering Conference (ICENCO).

[9]  Christopher-J. Schild,et al.  Linking Deutsche Bundesbank Company Data using Machine-Learning-Based Classification: Extended Abstract , 2016, DSMM@SIGMOD.

[10]  Maria Pershina,et al.  Graph-based Approaches to Resolve Entity Ambiguity , 2016 .

[11]  Robert K. Yin,et al.  Case Study Research and Applications: Design and Methods , 2017 .

[12]  Philipp Mayring,et al.  Qualitative Content Analysis: Theoretical Background and Procedures , 2015 .

[13]  Divesh Srivastava,et al.  Finding Quality in Quantity: The Challenge of Discovering Valuable Sources for Integration , 2015, CIDR.

[14]  Jorge Marx Gómez,et al.  Building a Connection Between Decision Maker and Data-Driven Decision Process , 2018 .

[15]  Ming Gao,et al.  Entity Matching Across Multiple Heterogeneous Data Sources , 2016, DASFAA.

[16]  José González Enríquez A model-driven engineering approach for the uniquely identity reconciliation of heterogeneous data sources , 2017 .

[17]  Andreas Thor,et al.  Evaluation of entity resolution approaches on real-world match problems , 2010, Proc. VLDB Endow..

[18]  Alon Y. Halevy,et al.  Data Integration: After the Teenage Years , 2017, PODS.

[19]  Mourad Ouzzani,et al.  Distributed representations of tuples for entity resolution , 2018, VLDB 2018.

[20]  Theodoros Rekatsinas,et al.  Deep Learning for Entity Matching: A Design Space Exploration , 2018, SIGMOD Conference.

[21]  Theodoros Rekatsinas,et al.  Data Integration and Machine Learning: A Natural Synergy , 2018, Proc. VLDB Endow..

[22]  Joachim Schmid,et al.  Datenintegration und Deduplizierung , 2015 .

[23]  Lin Li,et al.  A comparison of techniques for name matching. , 2012 .

[24]  Jianzhong Li,et al.  Data Source Selection for Information Integration in Big Data Era , 2016, Inf. Sci..

[25]  D. Blazquez,et al.  Big Data sources and methods for social and economic analyses , 2017 .

[26]  Erhard Rahm The Case for Holistic Data Integration , 2016, ADBIS.

[27]  Shafiq R. Joty,et al.  DeepER - Deep Entity Resolution , 2017, ArXiv.

[28]  Javier Tuya,et al.  Early Integration Testing for Entity Reconciliation in the Context of Heterogeneous Data Sources , 2018, IEEE Transactions on Reliability.

[29]  Divesh Srivastava,et al.  Big data integration , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).