Object Identification with Constraints

Object identification aims at identifying different representations of the same object based on noisy attributes such as descriptions of the same product in different online shops or references to the same paper in different publications. Numerous solutions have been proposed for solving this task, almost all of them based on similarity functions of a pair of objects. Although today the similarity functions are learned from a set of labeled training data, the structural information given by the labeled data is not used. By formulating a generic model for object identification we show how almost any proposed identification model can easily be extended for satisfying structural constraints. Therefore we propose a model that uses structural information given as pairwise constraints to guide collective decisions about object identification in addition to a learned similarity measure. We show with empirical experiments on public and on real-life data that combining both structural information and attribute-based similarity enormously increases the overall performance for object identification tasks.

[1]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[2]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[3]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[4]  William W. Cohen,et al.  Learning to match and cluster large high-dimensional data sets for data integration , 2002, KDD.

[5]  Craig A. Knoblock,et al.  Learning domain-independent string transformation weights for high accuracy object identification , 2002, KDD.

[6]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[7]  Raymond J. Mooney,et al.  A probabilistic framework for semi-supervised clustering , 2004, KDD.

[8]  S. S. Ravi,et al.  Agglomerative Hierarchical Clustering with Constraints: Theoretical and Empirical Results , 2005, PKDD.

[9]  Pedro M. Domingos,et al.  Object Identification with Attribute-Mediated Dependences , 2005, PKDD.

[10]  Inderjit S. Dhillon,et al.  Semi-supervised graph clustering: a kernel approach , 2005, ICML '05.

[11]  Rajeev Motwani,et al.  Robust identification of fuzzy duplicates , 2005, 21st International Conference on Data Engineering (ICDE'05).

[12]  Sugato Basu,et al.  Adaptive product normalization: using online learning for record linkage in comparison shopping , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).