Unsupervised deduplication using cross-field dependencies

Recent work in deduplication has shown that collective deduplication of different attribute types can improve performance. But although these techniques cluster the attributes collectively, they do not model them collectively. For example, in citations in the research literature, canonical venue strings and title strings are dependent -- because venues tend to focus on a few research areas -- but this dependence is not modeled by current unsupervised techniques. We call this dependence between fields in a record a cross-field dependence. In this paper, we present an unsupervised generative model for the deduplication problem that explicitly models cross-field dependence. Our model uses a single set of latent variables to control two disparate clustering models: a Dirichlet-multinomial model over titles, and a non-exchangeable string-edit model over venues. We show that modeling cross-field dependence yields a substantial improvement in performance -- a 58% reduction in error over a standard Dirichlet process mixture.

[1]  Stuart J. Russell,et al.  Probabilistic models with unknown objects , 2006 .

[2]  Pedro M. Domingos Multi-Relational Record Linkage , 2003 .

[3]  Andrew McCallum,et al.  Conditional Models of Identity Uncertainty with Application to Noun Coreference , 2004, NIPS.

[4]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[5]  Nando de Freitas,et al.  Nonparametric Bayesian Logic , 2005, UAI.

[6]  Dan Klein,et al.  Unsupervised Coreference Resolution in a Nonparametric Bayesian Model , 2007, ACL.

[7]  Breck Baldwin,et al.  Algorithms for Scoring Coreference Chains , 1998 .

[8]  Lise Getoor,et al.  Collective entity resolution in relational data , 2007, TKDD.

[9]  David B. Dahl,et al.  Sequentially-Allocated Merge-Split Sampler for Conjugate and Nonconjugate Dirichlet Process Mixture Models , 2005 .

[10]  Andrew McCallum,et al.  Joint deduplication of multiple record types in relational data , 2005, CIKM '05.

[11]  Stuart J. Russell,et al.  Identity Uncertainty and Citation Matching , 2002, NIPS.

[12]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[13]  Pedro M. Domingos,et al.  Entity Resolution with Markov Logic , 2006, Sixth International Conference on Data Mining (ICDM'06).

[14]  Lise Getoor,et al.  A Latent Dirichlet Model for Unsupervised Entity Resolution , 2005, SDM.

[15]  Yee Whye Teh,et al.  A Hierarchical Bayesian Language Model Based On Pitman-Yor Processes , 2006, ACL.

[16]  Andrew McCallum,et al.  An Integrated, Conditional Model of Information Extraction and Coreference with Appli , 2004, UAI.

[17]  Padhraic Smyth,et al.  Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model , 2006, NIPS.

[18]  Radford M. Neal Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .