Duplicate detection, also known as entity matching or record linkage, was first defined by Newcombe et al. [19] and has been a research topic for several decades. The challenge is to effectively and efficiently identify pairs of records that represent the same real world entity. Researchers have developed and described a variety of methods to measure the similarity of records and/or to reduce the number of required comparisons. Comparing these methods to each other is essential to assess their quality and efficiency. However, it is still difficult to compare results, as there usually are differences in the evaluated datasets, the similarity measures, the implementation of the algorithms, or simply the hardware on which the code is executed. To face this challenge, we are developing the comprehensive duplicate detection toolkit “DuDe”. DuDe already provides multiple methods and datasets for duplicate detection and consists of several components with clear interfaces that can be easily served with individual code. In this paper, we present the DuDe architecture and its workflow for duplicate detection. We show that DuDe allows to easily compare different algorithms and similarity measures, which is an important step towards a duplicate detection benchmark. 1. DUPLICATE DETECTION FRAMEWORKS The basic problem of duplicate detection has been studied under various names, such as entity matching, record linkage, merge/purge or record reconciliation. Given a set of entities, the goal is to identify the represented set of distinct real-world entities. Proposed algorithms in the area of duplicate detection aim to improve the efficiency or the effectiveness of the duplicate detection process. The goal of efficiency is usually to reduce the number of pairwise comparisons. In a naive approach this is quadratic in the number of records. By making intelligent guesses which records have a high probability of representing the same real-world entity, the search space is reduced with the drawback that some duPermission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, to post on servers or to redistribute to lists, requires a fee and/or special permission from the publisher, ACM. VLDB ‘10, September 13-17, 2010, Singapore Copyright 2010 VLDB Endowment, ACM 000-0-00000-000-0/00/00. plicates might be missed. Effectiveness, on the other hand, aims at classifying pairs of records accurately as duplicate or non-duplicate [17]. Elmagarmid et al. have compiled a survey of existing algorithms and techniques for duplicate detection [11]. Kopcke and Rahm give a comprehensive overview about existing duplicate detection frameworks [15]. They compare eleven frameworks and distinguish between frameworks without training (BN [16], MOMA [24], SERF [1]), training-based frameworks (Active Atlas [22], [23], MARLIN [2, 3], Multiple Classifier System [27], Operator Trees [4]) and hybrid frameworks (TAILOR [10], FEBRL [6], STEM [14], Context Based Framework [5]). Not included in the overview is STRINGER [12], which deals with approximate string matching in large data sources. Kopcke and Rahm use several comparison criteria, such as supported entity types (e.g. relational entities, XML), availability of partitioning methods to reduce the search space, used matchers to determine whether two entities are similar enough to represent the same real-world entity, the ability to combine several matchers, and, where necessary, the selection of training data. In their summary, Kopcke and Rahm criticize that the frameworks use diverse methodologies, measures, and test problems for evaluation and that it is therefore difficult to assess the efficiency and effectiveness of each single system. They argue that standardized entity matching benchmarks are needed and that researchers should provide prototype implementations and test data with their algorithms. This agrees with Neiling et al. [18], where desired properties of a test framework for object identification solutions are discussed. Moreover, Weis et al. [25] argue for a duplicate detection benchmark. Both papers see the necessity for standardized data from real-world or artificial datasets, which must also contain information about the real-world pairs. Additionally, clearly defined quality criteria with a description of their computation, and a detailed specification of the test procedure are required. An overview of quality and complexity measures for data linkage and deduplication can be found in Christen and Goiser [7] With DuDe, we provide a toolkit for duplicate detection that can easily be extended by new algorithms and components. Conducted experiments are comprehensible and can be compared with former ones. Additionally, several algorithms, similarity measures, and datasets with gold standards are provided, which is a requirement for a duplicate detection benchmark. DuDe and several datasets are available for download at http://www.hpi.uni-potsdam. de/naumann/projekte/dude.html.
[1]
Felix Naumann,et al.
Industry-scale duplicate detection
,
2008,
Proc. VLDB Endow..
[2]
Mikhail Bilenko and Raymond J. Mooney,et al.
On Evaluation and Training-Set Construction for Duplicate Detection
,
2003
.
[3]
Surajit Chaudhuri,et al.
Example-driven design of efficient record matching queries
,
2007,
VLDB.
[4]
Peter Christen,et al.
Quality and Complexity Measures for Data Linkage and Deduplication
,
2007,
Quality Measures in Data Mining.
[5]
Felix Naumann,et al.
A Duplicate Detection Benchmark for XML ( and Relational ) Data
,
2006
.
[6]
Renée J. Miller,et al.
Framework for Evaluating Clustering Algorithms in Duplicate Detection
,
2009,
Proc. VLDB Endow..
[7]
Felix Naumann,et al.
A Comparison and Generalization of Blocking and Windowing Algorithms for Duplicate Detection
,
2009
.
[8]
Pedro M. Domingos,et al.
Object Identification with Attribute-Mediated Dependences
,
2005,
PKDD.
[9]
Dmitri V. Kalashnikov,et al.
Exploiting context analysis for combining multiple entity resolution systems
,
2009,
SIGMOD Conference.
[10]
Ahmed K. Elmagarmid,et al.
TAILOR: a record linkage toolbox
,
2002,
Proceedings 18th International Conference on Data Engineering.
[11]
Sudha Ram,et al.
Entity identification for heterogeneous database integration--a multiple classifier system approach and empirical evaluation
,
2005,
Inf. Syst..
[12]
Raymond J. Mooney,et al.
Adaptive duplicate detection using learnable string similarity measures
,
2003,
KDD '03.
[13]
Peter Christen,et al.
Febrl: a freely available record linkage system with a graphical user interface
,
2008
.
[14]
Salvatore J. Stolfo,et al.
The merge/purge problem for large databases
,
1995,
SIGMOD '95.
[15]
Erhard Rahm,et al.
Frameworks for entity matching: A comparison
,
2010,
Data Knowl. Eng..
[16]
Andreas Thor,et al.
MOMA - A Mapping-based Object Matching System
,
2007,
CIDR.
[17]
Peter Christen,et al.
Accurate Synthetic Generation of Realistic Personal Information
,
2009,
PAKDD.
[18]
Pável Calado,et al.
Structure-based inference of xml similarity for fuzzy duplicate detection
,
2007,
CIKM '07.
[19]
Craig A. Knoblock,et al.
Learning domain-independent string transformation weights for high accuracy object identification
,
2002,
KDD.
[20]
Jennifer Widom,et al.
Swoosh: a generic approach to entity resolution
,
2008,
The VLDB Journal.
[21]
H B NEWCOMBE,et al.
Automatic linkage of vital records.
,
1959,
Science.
[22]
Felix Naumann,et al.
An Introduction to Duplicate Detection
,
2010,
An Introduction to Duplicate Detection.
[23]
Erhard Rahm,et al.
Training selection for tuning entity matching
,
2008,
QDB/MUD.
[24]
Felix Naumann,et al.
Object Identification Quality
,
2003
.
[25]
Craig A. Knoblock,et al.
Learning object identification rules for information integration
,
2001,
Inf. Syst..
[26]
Jayant Madhavan,et al.
Reference reconciliation in complex information spaces
,
2005,
SIGMOD '05.
[27]
Ahmed K. Elmagarmid,et al.
Duplicate Record Detection: A Survey
,
2007,
IEEE Transactions on Knowledge and Data Engineering.