MinoanER: Schema-Agnostic, Non-Iterative, Massively Parallel Resolution of Web Entities

Entity Resolution (ER) aims to identify different descriptions in various Knowledge Bases (KBs) that refer to the same entity. ER is challenged by the Variety, Volume and Veracity of entity descriptions published in the Web of Data. To address them, we propose the MinoanER framework that simultaneously fulfills full automation, support of highly heterogeneous entities, and massive parallelization of the ER process. MinoanER leverages a token-based similarity of entities to define a new metric that derives the similarity of neighboring entities from the most important relations, as they are indicated only by statistics. A composite blocking method is employed to capture different sources of matching evidence from the content, neighbors, or names of entities. The search space of candidate pairs for comparison is compactly abstracted by a novel disjunctive blocking graph and processed by a non-iterative, massively parallel matching algorithm that consists of four generic, schema-agnostic matching rules that are quite robust with respect to their internal configuration. We demonstrate that the effectiveness of MinoanER is comparable to existing ER tools over real KBs exhibiting low Variety, but it outperforms them significantly when matching KBs with high Variety.

[1]  Vasilis Efthymiou,et al.  Simplifying Entity Resolution on Web Data with Schema-Agnostic, Non-Iterative Matching , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[2]  C. Lee Giles,et al.  Adaptive sorted neighborhood methods for efficient record linkage , 2007, JCDL '07.

[3]  Alexandra Meliou,et al.  Data X-Ray: A Diagnostic Tool for Data Errors , 2015, SIGMOD Conference.

[4]  Andrew Borthwick,et al.  Dynamic Record Blocking: Efficient Linking of Massive Databases in MapReduce , 2012 .

[5]  Christoph Lange,et al.  Evaluating the quality of the LOD cloud: An empirical investigation , 2018, Semantic Web.

[6]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[7]  Bin Ma,et al.  On the similarity metric and the distance metric , 2009, Theor. Comput. Sci..

[8]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[9]  Vasilis Efthymiou,et al.  Entity resolution in the web of data , 2013, Entity Resolution in the Web of Data.

[10]  Divesh Srivastava,et al.  Big data integration , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[11]  Wolfgang Nejdl,et al.  Meta-Blocking: Taking Entity Resolutionto the Next Level , 2014, IEEE Transactions on Knowledge and Data Engineering.

[12]  Vasilis Efthymiou,et al.  Big data entity resolution: From highly to somehow similar entity descriptions in the Web , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[13]  Gjergji Kasneci,et al.  SIGMa: simple greedy matching for aligning large knowledge bases , 2012, KDD.

[14]  Serge Abiteboul,et al.  PARIS: Probabilistic Alignment of Relations, Instances, and Schema , 2011, Proc. VLDB Endow..

[15]  Dmitri V. Kalashnikov,et al.  ProgressER: Adaptive Progressive Approach to Relational Entity Resolution , 2018, ACM Trans. Knowl. Discov. Data.

[16]  Gerhard Weikum,et al.  LINDA: distributed web-of-data-scale entity matching , 2012, CIKM.

[17]  Peter Christen,et al.  A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[18]  Andreas Thor,et al.  Dedoop: Efficient Deduplication with Hadoop , 2012, Proc. VLDB Endow..

[19]  Yi Li,et al.  RiMOM: A Dynamic Multistrategy Ontology Alignment Framework , 2009, IEEE Transactions on Knowledge and Data Engineering.

[20]  Christian Bizer,et al.  WInte.r - A Web Data Integration Framework , 2017, International Semantic Web Conference.

[21]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[22]  Felix Naumann,et al.  A generalization of blocking and windowing algorithms for duplicate detection , 2011, 2011 International Conference on Data and Knowledge Engineering (ICDKE).

[23]  Raymond J. Mooney,et al.  Adaptive Blocking: Learning to Scale Up Record Linkage , 2006, Sixth International Conference on Data Mining (ICDM'06).

[24]  Lise Getoor,et al.  Collective entity resolution in relational data , 2007, TKDD.

[25]  Jeff Heflin,et al.  Automatically Generating Data Linkages Using a Domain-Independent Candidate Selection Approach , 2011, SEMWEB.

[26]  George Papastefanatos,et al.  Boosting the Efficiency of Large-Scale Entity Resolution with Enhanced Meta-Blocking , 2016, Big Data Res..

[27]  Alon Y. Halevy,et al.  Data Integration: After the Teenage Years , 2017, PODS.

[28]  Claudia Niederée,et al.  A Blocking Framework for Entity Resolution in Highly Heterogeneous Information Spaces , 2013, IEEE Transactions on Knowledge and Data Engineering.

[29]  Dmitri V. Kalashnikov,et al.  Progressive Approach to Relational Entity Resolution , 2014, Proc. VLDB Endow..

[30]  George Papastefanatos,et al.  Parallel meta-blocking for scaling entity resolution over big heterogeneous data , 2017, Inf. Syst..

[31]  Avigdor Gal,et al.  MFIBlocks: An effective blocking algorithm for entity resolution , 2013, Inf. Syst..

[32]  Nilesh N. Dalvi,et al.  Large-Scale Collective Entity Matching , 2011, Proc. VLDB Endow..

[33]  Gautam Shroff,et al.  Graph-Parallel Entity Resolution using LSH & IMM , 2014, EDBT/ICDT Workshops.

[34]  Avigdor Gal,et al.  Comparative Analysis of Approximate Blocking Techniques for Entity Resolution , 2016, Proc. VLDB Endow..

[35]  Juan-Zi Li,et al.  RiMOM-IM: A Novel Iterative Framework for Instance Matching , 2016, Journal of Computer Science and Technology.

[36]  Mayank Kejriwal,et al.  Disjunctive Normal Form Schemes for Heterogeneous Attributed Graphs , 2016, ArXiv.