Scalable Matching and Clustering of Entities with FAMER

Entity resolution identifies semantically equivalent entities, e.g. describing the same product or customer. It is especially challenging for Big Data applications where large volumes of data from many sources have to be matched and integrated. We therefore introduce a scalable entity resolution framework called FAMER (FAst Multi-source Entity Resolution system) that is based on Apache Flink for distributed execution and that can holistically match entities from multiple sources. For the latter purpose, FAMER includes multiple clustering schemes that group matching entities from different sources within clusters. In addition to previously known clustering schemes FAMER includes new approaches tailored to multi-source entity resolution. We perform a detailed comparative evaluation of eight clustering schemes for different real-life and synthetically generated datasets. The evaluation considers both the match quality as well as the scalability for different numbers of machines and data sizes.

[1]  Renée J. Miller,et al.  Creating probabilistic databases from duplicated data , 2009, The VLDB Journal.

[2]  Andreas Thor,et al.  Load Balancing for MapReduce-based Entity Resolution , 2011, 2012 IEEE 28th International Conference on Data Engineering.

[3]  Markus Nentwig,et al.  Incremental Clustering on Linked Data , 2018, 2018 IEEE International Conference on Data Mining Workshops (ICDMW).

[4]  Alieh Saeedi,et al.  Comparative Evaluation of Distributed Clustering Schemes for Multi-source Entity Resolution , 2017, ADBIS.

[5]  Ravi Kumar,et al.  Correlation clustering in MapReduce , 2014, KDD.

[6]  Avigdor Gal,et al.  Comparative Analysis of Approximate Blocking Techniques for Entity Resolution , 2016, Proc. VLDB Endow..

[7]  Erhard Rahm The Case for Holistic Data Integration , 2016, ADBIS.

[8]  Andreas Thor,et al.  Multi-pass sorted neighborhood blocking with MapReduce , 2012, Computer Science - Research and Development.

[9]  Andreas Thor,et al.  Learning-Based Approaches for Matching Web Data Entities , 2010, IEEE Internet Computing.

[10]  Alieh Saeedi,et al.  Interactive Visualization of Large Similarity Graphs and Entity Resolution Clusters , 2018, EDBT.

[11]  Alieh Saeedi,et al.  Using Link Features for Entity Clustering in Knowledge Graphs , 2018, ESWC.

[12]  Divesh Srivastava,et al.  Incremental Record Linkage , 2014, Proc. VLDB Endow..

[13]  Markus Nentwig,et al.  Distributed Holistic Clustering on Linked Data , 2017, OTM Conferences.

[14]  Dilpreet Singh,et al.  A survey on platforms for big data analytics , 2014, Journal of Big Data.

[15]  Erhard Rahm,et al.  Frameworks for entity matching: A comparison , 2010, Data Knowl. Eng..

[16]  Markus Nentwig,et al.  Holistic Entity Clustering for Linked Data , 2016, 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW).

[17]  Daniela Rus,et al.  Journal of Graph Algorithms and Applications the Star Clustering Algorithm for Static and Dynamic Information Organization , 2022 .

[18]  Erhard Rahm,et al.  Declarative and distributed graph analytics with GRADOOP , 2018, Proc. VLDB Endow..

[19]  Song Zhu,et al.  BigDedup: A Big Data Integration Toolkit for Duplicate Detection in Industrial Scenarios , 2018, TE.

[20]  Peter Christen,et al.  Flexible and extensible generation and corruption of personal data , 2013, CIKM.

[21]  Markus Nentwig,et al.  A survey of current Link Discovery frameworks , 2016, Semantic Web.

[22]  Norbert Ritter,et al.  Large-Scale Data Pollution with Apache Spark , 2020, IEEE Transactions on Big Data.

[23]  Aristides Gionis,et al.  Clustering Aggregation , 2005, ICDE.

[24]  Erhard Rahm,et al.  Management and Analysis of Big Graph Data: Current Systems and Open Challenges , 2017, Handbook of Big Data Technologies.

[25]  Dimitris S. Papailiopoulos,et al.  Parallel Correlation Clustering on Big Graphs , 2015, NIPS.

[26]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[27]  Renée J. Miller,et al.  Framework for Evaluating Clustering Algorithms in Duplicate Detection , 2009, Proc. VLDB Endow..

[28]  Carlos Eduardo S. Pires,et al.  An efficient spark-based adaptive windowing for entity matching , 2017, J. Syst. Softw..

[29]  Erhard Rahm,et al.  Analyzing extended property graphs with Apache Flink , 2016, NDA@SIGMOD.

[30]  Sören Auer,et al.  LIMES - A Time-Efficient Approach for Large-Scale Link Discovery on the Web of Data , 2011, IJCAI.

[31]  Avrim Blum,et al.  Correlation Clustering , 2004, Machine Learning.

[32]  Andreas Thor,et al.  Dedoop: Efficient Deduplication with Hadoop , 2012, Proc. VLDB Endow..

[33]  Maria Pershina,et al.  Holistic entity matching across knowledge graphs , 2015, 2015 IEEE International Conference on Big Data (Big Data).