An effective weighted rule-based method for entity resolution

Entity resolution is an important task in data cleaning to detect records that belong to the same entity. It has a critical impact on digital libraries where different entities share the same name without any identifier key. Conventional methods adopt similarity measures and clustering techniques to reveal the records of a specific entity. Due to the lack of performance, recent methods build rules on records’ attributes with distinct values for entities to overcome some drawbacks. However, they use inadequate attributes and ignore common and empty attributes values which affect the quality of entity resolution. In this paper, we define a multi-attributes weighted rule system (MAWR) that investigates all values of records’ attributes in order to represent the difficult record-entity mapping. Then, we propose a rule generation algorithm based on this system. We also propose an entity resolution algorithm (MAWR-ER) depending on the generated rules to identify entities. We verify our method on real data, and the experimental results prove the effectiveness and efficiency of our proposed method.

[1]  Philip S. Yu,et al.  Object Distinction: Distinguishing Objects with Identical Names , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[2]  Weiyi Meng,et al.  A Latent Topic Model for Complete Entity Resolution , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[3]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[4]  Jianzhong Li,et al.  Context-based entity description rule for entity resolution , 2011, CIKM '11.

[5]  Mohammad Al Hasan,et al.  Name disambiguation from link data in a collaboration graph using temporal and topological features , 2014, Social Network Analysis and Mining.

[6]  Jianzhong Li,et al.  Rule-Based Method for Entity Resolution , 2015, IEEE Transactions on Knowledge and Data Engineering.

[7]  Craig A. Knoblock,et al.  Learning object identification rules for information integration , 2001, Inf. Syst..

[8]  Jianzhong Li,et al.  EIF: A Framework of Effective Entity Identification , 2010, WAIM.

[9]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[10]  Luis Gravano,et al.  Text joins in an RDBMS for web data integration , 2003, WWW '03.

[11]  Hamideh Afsarmanesh,et al.  Entity resolution for distributed probabilistic data , 2013, Distributed and Parallel Databases.

[12]  Rajeev Motwani,et al.  Robust identification of fuzzy duplicates , 2005, 21st International Conference on Data Engineering (ICDE'05).

[13]  Jianyong Wang,et al.  On Graph-Based Name Disambiguation , 2011, JDIQ.

[14]  Jianzhong Li,et al.  Reasoning about Record Matching Rules , 2009, Proc. VLDB Endow..