Contextual Entity Resolution Approach for Genealogical Data

Due to huge amount of inaccurate information and dierent types of ambiguity in the available digitized genealogical data, apply- ing Entity Resolution techniques for determining the records referring to the same entity should be considered as the rst and still very im- portant step in analysis of this type of data. Traditional methods, use a standard string similarity measure to calculate the similarity among references, neglecting the contextual information available for each ref- erence, and then introduce the most similar pairs as matches. In this paper, rst, we introduce a novel blocking strategy to reduce the number of potential candidate pairs. Second, we propose a contextual similarity measure which not only considers the string similarity among references but also contextual information available for them. Third, we evaluate our proposed method extensively from dierent perspectives and among many discussed patterns, the \early child death" pattern discovered to be prominent.

[1]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[2]  Koenraad Verboven,et al.  A short manual to the art of prosopography , 2007 .

[3]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[4]  H B NEWCOMBE,et al.  Automatic linkage of vital records. , 1959, Science.

[5]  L. Philips,et al.  Hanging on the metaphone , 1990 .

[6]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[7]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[8]  Ashwin Machanavajjhala,et al.  Network sampling , 2013, KDD.

[9]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[10]  Toon Calders,et al.  A Baseline Method for Genealogical Entity Resolution , 2014 .

[11]  William E. Winkler,et al.  String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. , 1990 .

[12]  Charles Elkan,et al.  An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.

[13]  Lawrence Philips,et al.  The double metaphone search algorithm , 2000 .

[14]  Mark B. Wells Review: Donald E. Knuth, The Art of Computer Programming, Volume 1. Fundamental Algorithms and Volume 2. Seminumerical Algorithms , 1973 .

[15]  Tgk Toon Calders,et al.  An interactive, web-based tool for genealogical entity resolution , 2013, BNAIC 2013.

[16]  Toon Calders,et al.  A Hybrid Disambiguation Measure for Inaccurate Cultural Heritage Data , 2014, LaTeCH@EACL.

[17]  Henry A. Kautz,et al.  Hardening soft information sources , 2000, KDD '00.