Normalization of Duplicate Records from Multiple Sources

Data consolidation is a challenging issue in data integration. The usefulness of data increases when it is linked and fused with other data from numerous (Web) sources. The promise of Big Data hinges upon addressing several big data integration challenges, such as record linkage at scale, real-time data fusion, and integrating Deep Web. Although much work has been conducted on these problems, there is limited work on creating a uniform, standard record from a group of records corresponding to the same real-world entity. We refer to this task as record normalization. Such a record representation, coined normalized record, is important for both front-end and back-end applications. In this paper, we formalize the record normalization problem, present in-depth analysis of normalization granularity levels (e.g., record, field, and value-component) and of normalization forms (e.g., typical versus complete). We propose a comprehensive framework for computing the normalized record. The proposed framework includes a suit of record normalization methods, from naive ones, which use only the information gathered from records themselves, to complex strategies, which globally mine a group of duplicate records before selecting a value for an attribute of a normalized record. We conducted extensive empirical studies with all the proposed methods. We indicate the weaknesses and strengths of each of them and recommend the ones to be used in practice.

[1]  Ahmed K. Elmagarmid,et al.  Query-time record linkage and fusion over Web databases , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[2]  Peter Christen,et al.  A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[3]  Craig A. Knoblock,et al.  Learning object identification rules for information integration , 2001, Inf. Syst..

[4]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[5]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[6]  Clement T. Yu,et al.  Advanced Metasearch Engine Technology , 2010, Advanced Metasearch Engine Technology.

[7]  Andrew McCallum,et al.  A unified approach for schema matching, coreference and canonicalization , 2008, KDD.

[8]  Divesh Srivastava,et al.  Incremental Record Linkage , 2014, Proc. VLDB Endow..

[9]  Jing Yuan,et al.  Result Merging for Structured Queries on the Deep Web with Active Relevance Weight Estimation , 2017, Inf. Syst..

[10]  Graham A Stephen,et al.  Approximate String Matching , 1994, Encyclopedia of Algorithms.

[11]  William W. Cohen,et al.  A Comparison of String Metrics for Matching Names and Records , 2003 .

[12]  L. Venkata Subramaniam,et al.  Automating pattern discovery for rule based data standardization systems , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[13]  Jennifer Widom,et al.  Swoosh: a generic approach to entity resolution , 2008, The VLDB Journal.

[14]  Walid G. Aref,et al.  ORLF: A flexible framework for online record linkage and fusion , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[15]  Erhard Rahm,et al.  Frameworks for entity matching: A comparison , 2010, Data Knowl. Eng..

[16]  Dan Roth,et al.  Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence Making Better Informed Trust Decisions with Generalized Fact-Finding , 2022 .

[17]  Alexandra Meliou,et al.  Data X-Ray: A Diagnostic Tool for Data Errors , 2015, SIGMOD Conference.

[18]  Clement T. Yu,et al.  Rule-based deduplication of article records from bibliographic databases , 2014, Database J. Biol. Databases Curation.

[19]  Weifeng Su,et al.  Record Matching over Query Results from Multiple Web Databases , 2010, IEEE Transactions on Knowledge and Data Engineering.

[20]  Weiyi Meng,et al.  Efficient SPectrAl Neighborhood blocking for entity resolution , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[21]  Philip S. Yu,et al.  Truth Discovery with Multiple Conflicting Information Providers on the Web , 2007, IEEE Transactions on Knowledge and Data Engineering.

[22]  Divesh Srivastava,et al.  Truth Finding on the Deep Web: Is the Problem Solved? , 2012, Proc. VLDB Endow..

[23]  Felix Naumann,et al.  Data Fusion – Resolving Data Conflicts for Integration , 2009 .

[24]  Clement T. Yu,et al.  Merging Query Results From Local Search Engines for Georeferenced Objects , 2014, TWEB.

[25]  Clement T. Yu,et al.  Meaningful labeling of integrated query interfaces , 2006, VLDB.

[26]  Javed A. Aslam,et al.  Models for metasearch , 2001, SIGIR '01.

[27]  Li Wang,et al.  A Hybrid Framework for Product Normalization in Online Shopping , 2013, DASFAA.

[28]  Matthew Marzilli,et al.  Canonicalization of database records using adaptive similarity measures , 2007, KDD '07.

[29]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[30]  Kevin Chen-Chuan Chang,et al.  Accessing the web: from search to integration , 2006, SIGMOD Conference.