Scaling multiple-source entity resolution using statistically efficient transfer learning

We consider a serious, previously-unexplored challenge facing almost all approaches to scaling up entity resolution (ER) to multiple data sources: the prohibitive cost of labeling training data for supervised learning of similarity scores for each pair of sources. While there exists a rich literature describing almost all aspects of pairwise ER, this new challenge is arising now due to the unprecedented ability to acquire and store data from online sources, interest in features driven by ER such as enriched search verticals, and the uniqueness of noisy and missing data characteristics for each source. We show on real-world and synthetic data that for state-of-the-art techniques, the reality of heterogeneous sources means that the number of labeled training data must scale quadratically in the number of sources, just to maintain constant precision/recall. We address this challenge with a brand new transfer learning algorithm which requires far less training data (or equivalently, achieves superior accuracy with the same data) and is trained using fast convex optimization. The intuition behind our approach is to adaptively share structure learned about one scoring problem with all other scoring problems sharing a data source in common. We demonstrate that our theoretically-motivated approach improves upon existing techniques for multi-source ER.

[1]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[2]  Y. Nesterov Gradient methods for minimizing composite objective function , 2007 .

[3]  Andreas Thor,et al.  Evaluation of entity resolution approaches on real-world match problems , 2010, Proc. VLDB Endow..

[4]  Christopher K. I. Williams,et al.  An Expectation Maximisation Algorithm for One-to-Many Record Linkage, Illustrated on the Problem of Matching Far Infra-Red Astronomical Sources to Optical Counterparts , 2005 .

[5]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[6]  Martin J. Wainwright,et al.  A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers , 2009, NIPS.

[7]  Craig A. Knoblock,et al.  Learning domain-independent string transformation weights for high accuracy object identification , 2002, KDD.

[8]  John S. Lawson Record Linkage Techniques for Improving Online Genealogical Research using Census Index Records , 2006 .

[9]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[10]  Erhard Rahm,et al.  Frameworks for entity matching: A comparison , 2010, Data Knowl. Eng..

[11]  Peter Christen,et al.  Febrl: a freely available record linkage system with a graphical user interface , 2008 .

[12]  Emmanuel J. Candès,et al.  Decoding by linear programming , 2005, IEEE Transactions on Information Theory.

[13]  Ali Jalali,et al.  A Dirty Model for Multi-task Learning , 2010, NIPS.

[14]  Sudha Ram,et al.  Entity identification for heterogeneous database integration--a multiple classifier system approach and empirical evaluation , 2005, Inf. Syst..

[15]  Tommi S. Jaakkola,et al.  Maximum-Margin Matrix Factorization , 2004, NIPS.

[16]  Craig A. Knoblock,et al.  A heterogeneous field matching method for record linkage , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[17]  David L Donoho,et al.  Compressed sensing , 2006, IEEE Transactions on Information Theory.

[18]  Pablo A. Parrilo,et al.  Rank-Sparsity Incoherence for Matrix Decomposition , 2009, SIAM J. Optim..

[19]  S. Geer HIGH-DIMENSIONAL GENERALIZED LINEAR MODELS AND THE LASSO , 2008, 0804.0703.

[20]  N G Becker,et al.  Deep vein thrombosis and air travel: record linkage study , 2003, BMJ : British Medical Journal.

[21]  Mikhail Bilenko and Raymond J. Mooney,et al.  On Evaluation and Training-Set Construction for Duplicate Detection , 2003 .

[22]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[23]  Junzhou Huang,et al.  The Benefit of Group Sparsity , 2009 .

[24]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[25]  Tong Zhang,et al.  A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , 2005, J. Mach. Learn. Res..

[26]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[27]  Arie Segev,et al.  A Framework for Object Matching in Federated Databases and Its Implementation , 1996, Int. J. Cooperative Inf. Syst..

[28]  P. Bickel,et al.  SIMULTANEOUS ANALYSIS OF LASSO AND DANTZIG SELECTOR , 2008, 0801.1095.

[29]  Martin J. Wainwright,et al.  Simultaneous Support Recovery in High Dimensions: Benefits and Perils of Block $\ell _{1}/\ell _{\infty} $-Regularization , 2009, IEEE Transactions on Information Theory.

[30]  Erhard Rahm,et al.  Training selection for tuning entity matching , 2008, QDB/MUD.

[31]  Martin J. Wainwright,et al.  Fast global convergence rates of gradient methods for high-dimensional statistical recovery , 2010, NIPS.

[32]  Don X. Sun,et al.  Methods for Linking and Mining Massive Heterogeneous Databases , 1998, KDD.

[33]  Raghu Ramakrishnan,et al.  Source-aware Entity Matching: A Compositional Approach , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[34]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[35]  Dmitri V. Kalashnikov,et al.  Exploiting context analysis for combining multiple entity resolution systems , 2009, SIGMOD Conference.

[36]  Larry A. Wasserman,et al.  Nonparametric regression and classification with joint sparsity constraints , 2008, NIPS.