Multi-source uncertain entity resolution: Transforming holocaust victim reports into people

In this work we present a multi-source uncertain entity resolution model and show its implementation in a use case of Yad Vashem, the central repository of Holocaust-era information. The Yad Vashem dataset is unique with respect to classic entity resolution, by virtue of being both massively multi-source and by requiring multi-level entity resolution. With today's abundance of information sources, this project motivates the use of multi-source resolution on a big-data scale. We instantiate the proposed model using the MFIBlocks entity resolution algorithm and a machine learning approach, based upon decision trees to transform soft clusters into ranked clustering of records, representing possible entities. An extensive empirical evaluation demonstrates the unique properties of this dataset that make it a good candidate for multi-source entity resolution. We conclude with proposing avenues for future research in this realm. HighlightsUncertain Entity Resolution allows creating multiple narratives from complementary sources of data.The approach was demonstrated during a unique project performed on the Yad Vashem Names database.Algorithms implementing the approach were empirically evaluated on a tagged subset on various configurations and versus equivalent algorithms.The accurate and insightful results are being integrated into Yad Vashem systems and user applications.

[1]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[2]  Qiang Yang,et al.  A Machine Learning Approach for Instance Matching Based on Similarity Metrics , 2012, SEMWEB.

[3]  Andreas Thor,et al.  Evaluation of entity resolution approaches on real-world match problems , 2010, Proc. VLDB Endow..

[4]  Panagiotis G. Ipeirotis,et al.  Duplicate Record Detection: A Survey , 2007 .

[5]  Renée J. Miller,et al.  Clean Answers over Dirty Databases: A Probabilistic Approach , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[6]  William W. Cohen Data integration using similarity joins and a word-based information representation language , 2000, TOIS.

[7]  William E. Winkler,et al.  Methods for Record Linkage and Bayesian Networks , 2002 .

[8]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[9]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[10]  W. Winkler Overview of Record Linkage and Current Research Directions , 2006 .

[11]  Peter Christen,et al.  Febrl: a freely available record linkage system with a graphical user interface , 2008 .

[12]  Avigdor Gal Tutorial: Uncertain Entity Resolution , 2014, Proc. VLDB Endow..

[13]  Tobias Blanke,et al.  Integrating Holocaust Research , 2013 .

[14]  Yoav Freund,et al.  The Alternating Decision Tree Learning Algorithm , 1999, ICML.

[15]  Keizo Oyama,et al.  A Fast Linkage Detection Scheme for Multi-Source Information Integration , 2005, International Workshop on Challenges in Web Information Retrieval and Integration.

[16]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[17]  Peter Christen,et al.  Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface , 2008, KDD.

[18]  Avigdor Gal Uncertain entity resolution: re-evaluating entity resolution in the big data era: tutorial , 2014, VLDB 2014.

[19]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[20]  Claudia Niederée,et al.  A Blocking Framework for Entity Resolution in Highly Heterogeneous Information Spaces , 2013, IEEE Transactions on Knowledge and Data Engineering.

[21]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[22]  Peter Christen,et al.  A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[23]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[24]  Felix Naumann,et al.  An Introduction to Duplicate Detection , 2010, An Introduction to Duplicate Detection.

[25]  Avigdor Gal,et al.  Multi-Source Uncertain Entity Resolution at Yad Vashem: Transforming Holocaust Victim Reports into People , 2016, SIGMOD Conference.

[26]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[27]  Claudia Niederée,et al.  On-the-fly entity-aware query processing in the presence of linkage , 2010, Proc. VLDB Endow..

[28]  Christian Borgelt,et al.  An implementation of the FP-growth algorithm , 2005 .

[29]  Avigdor Gal,et al.  MFIBlocks: An effective blocking algorithm for entity resolution , 2013, Inf. Syst..

[30]  Avigdor Gal,et al.  Comparative Analysis of Approximate Blocking Techniques for Entity Resolution , 2016, Proc. VLDB Endow..

[31]  Yongtao Ma,et al.  TYPiMatch: type-specific unsupervised learning of keys and key values for heterogeneous web data integration , 2013, WSDM.

[32]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[33]  Robert L. Surowka Modeling and querying possible repairs in duplicate detection , 2010 .

[34]  Rajeev Motwani,et al.  Robust identification of fuzzy duplicates , 2005, 21st International Conference on Data Engineering (ICDE'05).