Establishing Identity Equivalence in Multi-Relational Domains

Identity Equivalence or Alias Detection is an important topic in Intelligence Analysis. Often, terrorists will use multiple different identities to avoid detection. We apply machine learning to the task of determining Identity Equivalence. Two challenges exist in this domain. First, data can be spread across multiple tables. Second, we need to limit the number of false positives. We present a two step approach to combat these issues. First, we use Inductive Logic Programming to find a set of rules that are predictive of aliases. In the second step, we treat each learned rule as a random variable in a Bayesian Network. We use the Bayesian Network to assign a probability that two identities are aliases. We evaluate our technique on several data sets and find that layering Bayesian Network over the rules significantly increases the precision of our system.

[1]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[2]  Arno J. Knobbe,et al.  Propositionalisation and Aggregates , 2001, PKDD.

[3]  Ashwin Srinivasan,et al.  Feature construction with Inductive Logic Programming: A Study of Quantitative Predictions of Biological Activity Aided by Structural Attributes , 1999, Data Mining and Knowledge Discovery.

[4]  Céline Rouveirol,et al.  Lazy Propositionalisation for Relational Learning , 2000, ECAI.

[5]  Foster J. Provost,et al.  Aggregation-based feature invention and relational concept classes , 2003, KDD '03.

[6]  C. Lee Giles,et al.  Autonomous citation matching , 1999, AGENTS '99.

[7]  Dmitry Zelenko,et al.  Kernel Methods for Relation Extraction , 2002, J. Mach. Learn. Res..

[8]  Thomas S. Morton,et al.  Coreference for NLP Applications , 2000, ACL.

[9]  Jennifer Neville,et al.  Learning relational probability trees , 2003, KDD '03.

[10]  Andrew McCallum,et al.  Toward Conditional Models of Identity Uncertainty with Application to Proper Noun Coreference , 2003, IIWeb.

[11]  Thomas D. Nielsen,et al.  Bayesian Networks as Classifiers , 2007 .

[12]  Igor Kononenko,et al.  Naive Bayesian classifier within ILP-R , 1995 .

[13]  Nir Friedman,et al.  Learning Bayesian Network Structure from Massive Datasets: The "Sparse Candidate" Algorithm , 1999, UAI.

[14]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[15]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[16]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[17]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[18]  Jesse Davis,et al.  Using Bayesian Classifiers to Combine Rules , 2004 .

[19]  Paul Hsiung,et al.  Alias Detection in Link Data Sets , 2004 .

[20]  André Valente,et al.  The KOJAK Group Finder: Connecting the Dots via Integrated Knowledge-Based and Statistical Reasoning , 2004, AAAI.

[21]  Stuart J. Russell,et al.  Identity Uncertainty and Citation Matching , 2002, NIPS.

[22]  Ashwin Srinivasan,et al.  Feature Construction with Inductive Logic Programming: A Study of Quantitative Predictions of Biological Activity by Structural Attributes , 1996, Inductive Logic Programming Workshop.

[23]  William W. Cohen,et al.  Learning to match and cluster large high-dimensional data sets for data integration , 2002, KDD.

[24]  Andrew Borthwick,et al.  ClueMaker: A Language for Approximate Record Matching , 2003, ICIQ.