Cleansing data from synonyms and homonyms is a relevant task in fields where high quality of data is crucial, for example in disease registries and medical research networks. Record linkage provides methods for minimizing synonym and homonym errors thereby improving data quality. We focus our attention to the case of homonym errors (in the following denoted as 'false matches'), in which records belonging to different entities are wrongly classified as equal. Synonym errors ('false non-matches') occur when a single entity maps to multiple records in the linkage result. They are not considered in this study because in our application domain they are not as crucial as false matches. False match rates are frequently computed manually through a clerical review, so without modelling the distribution of the false match rates a priori. An exception is the work of Belin and Rubin (1995) [4]. They propose to estimate the false match rate by means of a normal mixture model that needs training data for a calibration process. In this paper we present a new approach for estimating the false match rate within the framework of Fellegi and Sunter by methods of Extreme Value Theory (EVT). This approach needs no training data for determining the threshold for matches and therefore leads to a significant cost-reduction. After giving two different definitions of the false match rate, we present the tools of the EVT used in this paper: the generalized Pareto distribution and the mean excess plot. Our experiments with real data show that the model works well, with only slightly lower accuracy compared to a procedure that has information about the match status and that maximizes the accuracy.
[1]
PAUL EMBRECHTS,et al.
Modelling of extremal events in insurance and finance
,
1994,
Math. Methods Oper. Res..
[2]
Michael Thomas,et al.
Statistical Analysis of Extreme Values
,
2008
.
[3]
Enrique Castillo.
Extreme value theory in engineering
,
1988
.
[4]
Howard B. Newcombe,et al.
Handbook of record linkage: methods for health and statistical studies, administration, and business
,
1988
.
[5]
Scott L. DuVall,et al.
Extending the Fellegi-Sunter probabilistic record linkage method for approximate field comparators
,
2010,
J. Biomed. Informatics.
[6]
D. Rubin,et al.
A method for calibrating false-match rates in record linkage
,
1995
.
[7]
Murat Sariyar,et al.
Evaluation of Record Linkage Methods for Iterative Insertions
,
2009,
Methods of Information in Medicine.
[8]
Ivan P. Fellegi,et al.
A Theory for Record Linkage
,
1969
.
[9]
Panagiotis G. Ipeirotis,et al.
Duplicate Record Detection: A Survey
,
2007
.