Duplicate Records Cleansing with Length Filtering and Dynamic Weighting

Due to diversity of data formats, missing of certain properties, imprecise records in heterogeneous literature databases, there exist duplicate records when integrating heterogeneous databases. Duplicate records lower the efficiency of information retrieval. In this paper, we propose an approach, named length filtering and dynamic weighting (LFDW) for duplicate records cleansing. There are three steps in LFDW. The first step is length filtering. In this step, according to the length of record, those record pairs are sifted if there exists a big difference in their lengths. Secondly, this approach detects duplicate records using dynamic weighting properties. Specially, since author name is the important property of literature and one author may has different styles of name, a fuzzy name matching method is adopted to identify the same author who has different name style. Finally, to improve the performance of duplicate detection, we adopt a dynamic sliding-window algorithm when comparing records. The result indicates the time, recall and precision of LFDW are better than traditional ones.

[1]  Abraham Kandel,et al.  On the Weighted Mean of a Pair of Strings , 2002, Pattern Analysis & Applications.

[2]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[3]  Hai Jin,et al.  SemreX: Towards Large-Scale Literature Information Retrieval and Browsing with Semantic Association , 2006, 2006 IEEE International Conference on e-Business Engineering (ICEBE'06).

[4]  Dror G. Feitelson,et al.  On identifying name equivalences in digital libraries , 2004, Inf. Res..

[5]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[6]  Vijay H. Kothari,et al.  Cleaning the spurious links in data , 2004, IEEE Intelligent Systems.

[7]  Byung-Won On,et al.  Effective and scalable solutions for mixed and split citation problems in digital libraries , 2005, IQIS '05.

[8]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[9]  Tok Wang Ling,et al.  A New Efficient Data Cleansing Method , 2002, DEXA.

[10]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[11]  Andrian Marcus,et al.  Data Cleansing: Beyond Integrity Analysis 1 , 2000 .

[12]  Craig A. Knoblock,et al.  A heterogeneous field matching method for record linkage , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[13]  Lise Getoor,et al.  Iterative record linkage for cleaning and integration , 2004, DMKD '04.

[14]  Andrian Marcus,et al.  Data Cleansing: Beyond Integrity Analysis , 2000, IQ.

[15]  David J. DeWitt,et al.  Duplicate record elimination in large data files , 1983, TODS.

[16]  Dongwon Lee,et al.  Blocking-aware private record linkage , 2005, IQIS '05.

[17]  Vanessa C. Klaas,et al.  Who's Who in the World Wide Web: Approaches to Name Disambiguation , 2007 .

[18]  Charles Elkan,et al.  An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.