Collective Entity Resolution in Familial Networks

Entity resolution in settings with rich relational structure often introduces complex dependencies between co-references. Exploiting these dependencies is challenging - it requires seamlessly combining statistical, relational, and logical dependencies. One task of particular interest is entity resolution in familial networks. In this setting, multiple partial representations of a family tree are provided, from the perspective of different family members, and the challenge is to reconstruct a family tree from these multiple, noisy, partial views. This reconstruction is crucial for applications such as understanding genetic inheritance, tracking disease contagion, and performing census surveys. Here, we design a model that incorporates statistical signals, such as name similarity, relational information, such as sibling overlap, and logical constraints, such as transitivity and bijective matching, in a collective model. We show how to integrate these features using probabilistic soft logic, a scalable probabilistic programming framework. In experiments on real-world data, our model significantly outperforms state-of-the-art classifiers that use relational features but are incapable of collective reasoning.

[1]  Serge Abiteboul,et al.  PARIS: Probabilistic Alignment of Relations, Instances, and Schema , 2011, Proc. VLDB Endow..

[2]  Stephen H. Bach,et al.  Hinge-Loss Markov Random Fields and Probabilistic Soft Logic , 2015, J. Mach. Learn. Res..

[3]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[4]  R. Hanneman Introduction to Social Network Methods , 2001 .

[5]  Pedro M. Domingos,et al.  Entity Resolution with Markov Logic , 2006, Sixth International Conference on Data Mining (ICDM'06).

[6]  Lise Getoor,et al.  Hinge-loss Markov Random Fields: Convex Inference for Structured Prediction , 2013, UAI.

[7]  Lise Getoor,et al.  A short introduction to probabilistic soft logic , 2012, NIPS 2012.

[8]  Lise Getoor,et al.  Generic Statistical Relational Entity Resolution in Knowledge Graphs , 2016, ArXiv.

[9]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[10]  Toon Calders,et al.  Multi-Source Entity Resolution for Genealogical Data , 2015, Population Reconstruction.

[11]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[12]  Christopher Ré,et al.  Large-Scale Deduplication with Constraints Using Dedupalog , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[13]  Sebastian Nowozin,et al.  Advanced Structured Prediction , 2014 .

[14]  Andrew McCallum,et al.  Joint deduplication of multiple record types in relational data , 2005, CIKM '05.

[15]  L. Koehly,et al.  What You Don't Know: Improving Family Health History Knowledge among Multigenerational Families of Mexican Origin , 2016, Public Health Genomics.

[16]  C. Shen,et al.  Linkage of patient records from disparate sources , 2013, Statistical methods in medical research.

[17]  H. Goldstein,et al.  Evaluating bias due to data linkage error in electronic healthcare records , 2014, BMC Medical Research Methodology.

[18]  Pigi Kouki,et al.  Entity Resolution in Familial Networks , 2016 .

[19]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[20]  Dmitri V. Kalashnikov,et al.  Domain-independent data cleaning via analysis of entity-relationship graph , 2006, TODS.

[21]  Lise Getoor,et al.  Collective entity resolution in relational data , 2007, TKDD.

[22]  Nilesh N. Dalvi,et al.  Large-Scale Collective Entity Matching , 2011, Proc. VLDB Endow..

[23]  W. Winkler Overview of Record Linkage and Current Research Directions , 2006 .

[24]  Eibe Frank,et al.  Logistic Model Trees , 2003, ECML.

[25]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[26]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.