Multi-Source Uncertain Entity Resolution at Yad Vashem: Transforming Holocaust Victim Reports into People

In this work we describe an entity resolution project performed at Yad Vashem, the central repository of Holocaust-era information. The Yad Vashem dataset is unique with respect to classic entity resolution, by virtue of being both massively multi-source and by requiring multi-level entity resolution. With today's abundance of information sources, this project sets an example for multi-source resolution on a big-data scale. We discuss a set of requirements that led us to choose the MFIBlocks entity resolution algorithm in achieving the goals of the application. We also provide a machine learning approach, based upon decision trees to transform soft clusters into ranked clustering of records, representing possible entities. An extensive empirical evaluation demonstrates the unique properties of this dataset, highlighting the shortcomings of current methods and proposing avenues for future research in this realm.

[1]  Avigdor Gal Uncertain entity resolution: re-evaluating entity resolution in the big data era: tutorial , 2014, VLDB 2014.

[2]  Claudia Niederée,et al.  On-the-fly entity-aware query processing in the presence of linkage , 2010, Proc. VLDB Endow..

[3]  Christian Borgelt,et al.  An implementation of the FP-growth algorithm , 2005 .

[4]  Avigdor Gal,et al.  MFIBlocks: An effective blocking algorithm for entity resolution , 2013, Inf. Syst..

[5]  Rajeev Motwani,et al.  Robust identification of fuzzy duplicates , 2005, 21st International Conference on Data Engineering (ICDE'05).

[6]  Andreas Thor,et al.  Evaluation of entity resolution approaches on real-world match problems , 2010, Proc. VLDB Endow..

[7]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[8]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[9]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[10]  W. Winkler Overview of Record Linkage and Current Research Directions , 2006 .

[11]  Peter Christen,et al.  Febrl: a freely available record linkage system with a graphical user interface , 2008 .

[12]  Avigdor Gal Tutorial: Uncertain Entity Resolution , 2014, Proc. VLDB Endow..

[13]  William E. Winkler,et al.  Methods for Record Linkage and Bayesian Networks , 2002 .

[14]  Tobias Blanke,et al.  Integrating Holocaust Research , 2013 .

[15]  Peter Christen,et al.  A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[16]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[17]  Felix Naumann,et al.  An Introduction to Duplicate Detection , 2010, An Introduction to Duplicate Detection.

[18]  Shai Ben-David,et al.  Modeling and Querying Possible Repairs in Duplicate Detection , 2009, Proc. VLDB Endow..

[19]  Renée J. Miller,et al.  Clean Answers over Dirty Databases: A Probabilistic Approach , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[20]  William W. Cohen Data integration using similarity joins and a word-based information representation language , 2000, TOIS.

[21]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near duplicate detection , 2008, WWW.

[22]  Qiang Yang,et al.  A Machine Learning Approach for Instance Matching Based on Similarity Metrics , 2012, SEMWEB.

[23]  Yoav Freund,et al.  The Alternating Decision Tree Learning Algorithm , 1999, ICML.

[24]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[25]  Peter Christen,et al.  Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface , 2008, KDD.