Incremental entity resolution process over query results for data integration systems

Entity Resolution (ER) in data integration systems is the problem of identifying groups of tuples from one or multiple data sources that represent the same real-world entity. This is a crucial stage of data integration processes, which often need to integrate data at query-time. This task becomes even more challenging in scenarios with dynamic data sources or when a large volume of data needs to be integrated. Then, to deal with large volumes of data, new ER solutions have been proposed. One possible approach consists in performing the ER process over query results rather than in the whole set of tuples being integrated. Additionally, previous results of ER tasks can be reused in order to reduce the number of comparisons between pairs of tuples at query-time. In a similar way, indexing techniques can also be employed to help the identification of equivalent tuples and to reduce the number of comparisons between pairs of tuples. In this context, this work proposes an incremental ER process over query results. The contributions of this work are the specification, the implementation and the evaluation of the proposed incremental process. We performed some experiments and we concluded that the incremental ER at query-time is more efficient than traditional ER processes.

[1]  Hector Garcia-Molina,et al.  Incremental entity resolution on rules and data , 2014, The VLDB Journal.

[2]  Alok N. Choudhary,et al.  Incremental, distributed single-linkage hierarchical clustering algorithm using mapreduce , 2015, SpringSim.

[3]  Peter Christen,et al.  Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface , 2008, KDD.

[4]  Felix Naumann,et al.  Progressive Duplicate Detection , 2015, IEEE Transactions on Knowledge and Data Engineering.

[5]  Alon Y. Halevy,et al.  Principles of Data Integration , 2012 .

[6]  Dmitri V. Kalashnikov,et al.  Progressive Approach to Relational Entity Resolution , 2014, Proc. VLDB Endow..

[7]  Sanguthevar Rajasekaran,et al.  Efficient sequential and parallel algorithms for record linkage , 2013, J. Am. Medical Informatics Assoc..

[8]  Erhard Rahm,et al.  Schema Matching and Mapping , 2013, Schema Matching and Mapping.

[9]  Gyu Sang Choi,et al.  Discriminative and deterministic approaches towards entity resolution , 2014, Journal of Intelligent Information Systems.

[10]  Ana Carolina Salgado,et al.  Dynamic Indexing for Incremental Entity Resolution in Data Integration Systems , 2017, ICEIS.

[11]  Divesh Srivastava,et al.  Big Data Integration , 2015, Synthesis Lectures on Data Management.

[12]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[13]  Lorena Otero-Cerdeira,et al.  Ontology matching: A literature review , 2015, Expert Syst. Appl..

[14]  Sushil Jajodia,et al.  Constructing a virtual primary key for fingerprinting relational data , 2003, DRM '03.

[15]  Hotham Altwaijry,et al.  Query-Driven Approach to Entity Resolution , 2013, Proc. VLDB Endow..

[16]  Alfredo Cuzzocrea,et al.  SJClust: Towards a Framework for Integrating Similarity Join Algorithms and Clustering , 2016, ICEIS.

[17]  Lise Getoor,et al.  Entity Resolution in Graphs , 2005 .

[18]  Jennifer Widom,et al.  Swoosh: a generic approach to entity resolution , 2008, The VLDB Journal.

[19]  John Yen,et al.  An incremental approach to building a cluster hierarchy , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[20]  Ana Carolina Salgado,et al.  A Query-Driven, Incremental Process for Entity Resolution , 2016, AMW.

[21]  Hector Garcia-Molina,et al.  Entity resolution with evolving rules , 2010, Proc. VLDB Endow..

[22]  Vanessa Braganholo,et al.  Detecting referential inconsistencies in electronic CV datasets , 2017, Journal of the Brazilian Computer Society.

[23]  Erhard Rahm,et al.  Frameworks for entity matching: A comparison , 2010, Data Knowl. Eng..

[24]  Hector Garcia-Molina,et al.  Pay-As-You-Go Entity Resolution , 2013, IEEE Transactions on Knowledge and Data Engineering.

[25]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[26]  Divesh Srivastava,et al.  Incremental Record Linkage , 2014, Proc. VLDB Endow..

[27]  Weifeng Su,et al.  Record Matching over Query Results from Multiple Web Databases , 2010, IEEE Transactions on Knowledge and Data Engineering.

[28]  Hotham Altwaijry,et al.  QuERy: A Framework for Integrating Entity Resolution with Query Processing , 2015, Proc. VLDB Endow..

[29]  Lise Getoor,et al.  Query-time entity resolution , 2006, KDD '06.

[30]  H. Edelsbrunner,et al.  Efficient algorithms for agglomerative hierarchical clustering methods , 1984 .

[31]  Divesh Srivastava,et al.  Record linkage with uniqueness constraints and erroneous values , 2010, Proc. VLDB Endow..

[32]  Ashwin Machanavajjhala,et al.  Entity Resolution: Theory, Practice & Open Challenges , 2012, Proc. VLDB Endow..

[33]  Claire Mathieu,et al.  Online Correlation Clustering , 2010, STACS.

[34]  Divesh Srivastava,et al.  Fusing data with correlations , 2014, SIGMOD Conference.

[35]  C. Lee Giles,et al.  Efficient Name Disambiguation for Large-Scale Databases , 2006, PKDD.

[36]  Avigdor Gal,et al.  Comparative Analysis of Approximate Blocking Techniques for Entity Resolution , 2016, Proc. VLDB Endow..

[37]  Maurizio Lenzerini Ontology-based data management , 2011, CIKM '11.

[38]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[39]  Divesh Srivastava,et al.  Online Entity Resolution Using an Oracle , 2016, Proc. VLDB Endow..

[40]  Marc Teboulle,et al.  Grouping Multidimensional Data - Recent Advances in Clustering , 2006 .

[41]  Steven R. Young,et al.  A Fast and Stable Incremental Clustering Algorithm , 2010, 2010 Seventh International Conference on Information Technology: New Generations.