Efficient SPectrAl Neighborhood blocking for entity resolution

In many telecom and web applications, there is a need to identify whether data objects in the same source or different sources represent the same entity in the real-world. This problem arises for subscribers in multiple services, customers in supply chain management, and users in social networks when there lacks a unique identifier across multiple data sources to represent a real-world entity. Entity resolution is to identify and discover objects in the data sets that refer to the same entity in the real world. We investigate the entity resolution problem for large data sets where efficient and scalable solutions are needed. We propose a novel unsupervised blocking algorithm, namely SPectrAl Neighborhood (SPAN), which constructs a fast bipartition tree for the records based on spectral clustering such that real entities can be identified accurately by neighborhood records in the tree. There are two major novel aspects in our approach: 1)We develop a fast algorithm that performs spectral clustering without computing pairwise similarities explicitly, which dramatically improves the scalability of the standard spectral clustering algorithm; 2) We utilize a stopping criterion specified by Newman-Girvan modularity in the bipartition process. Our experimental results with both synthetic and real-world data demonstrate that SPAN is robust and outperforms other blocking algorithms in terms of accuracy while it is efficient and scalable to deal with large data sets.

[1]  Lise Getoor,et al.  Iterative record linkage for cleaning and integration , 2004, DMKD '04.

[2]  William W. Cohen,et al.  Learning to match and cluster large high-dimensional data sets for data integration , 2002, KDD.

[3]  Georgia Koutrika,et al.  Entity resolution with iterative blocking , 2009, SIGMOD Conference.

[4]  Lise Getoor,et al.  A Latent Dirichlet Model for Unsupervised Entity Resolution , 2005, SDM.

[5]  Sudipto Guha,et al.  Merging the Results of Approximate Match Operations , 2004, VLDB.

[6]  Gert Vegter,et al.  In handbook of discrete and computational geometry , 1997 .

[7]  Golub Gene H. Et.Al Matrix Computations, 3rd Edition , 2007 .

[8]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[9]  James Mayfield,et al.  Character N-Gram Tokenization for European Language Text Retrieval , 2004, Information Retrieval.

[10]  Charu C. Aggarwal,et al.  On the effects of dimensionality reduction on high dimensional similarity search , 2001, PODS.

[11]  Hui Han,et al.  Name disambiguation in author citations using a K-way spectral clustering method , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[12]  Carlo Batini,et al.  Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications) , 2006 .

[13]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[14]  Peter Christen,et al.  Febrl - Freely extensible biomedical record linkage , 2002 .

[15]  Weiyi Meng,et al.  A Latent Topic Model for Complete Entity Resolution , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[16]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[17]  W. Winkler Overview of Record Linkage and Current Research Directions , 2006 .

[18]  Esko Ukkonen,et al.  Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..

[19]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[20]  H B NEWCOMBE,et al.  Automatic linkage of vital records. , 1959, Science.

[21]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[22]  Renée J. Miller,et al.  Framework for Evaluating Clustering Algorithms in Duplicate Detection , 2009, Proc. VLDB Endow..

[23]  Piotr Indyk,et al.  Nearest Neighbors in High-Dimensional Spaces , 2004, Handbook of Discrete and Computational Geometry, 2nd Ed..

[24]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[25]  Don X. Sun,et al.  Methods for Linking and Mining Massive Heterogeneous Databases , 1998, KDD.

[26]  James Bennett,et al.  The Netflix Prize , 2007 .

[27]  Ahmed K. Elmagarmid,et al.  Automating the approximate record-matching process , 2000, Inf. Sci..

[28]  Stuart J. Russell,et al.  Identity Uncertainty and Citation Matching , 2002, NIPS.

[29]  Jitendra Malik,et al.  Spectral grouping using the Nystrom method , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[31]  Peter Christen,et al.  A Comparison of Fast Blocking Methods for Record Linkage , 2003, KDD 2003.

[32]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[33]  Carlo Batini,et al.  Data Quality: Concepts, Methodologies and Techniques , 2006, Data-Centric Systems and Applications.

[34]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[35]  Luis Gravano,et al.  Text joins in an RDBMS for web data integration , 2003, WWW '03.

[36]  William W. Cohen,et al.  Power Iteration Clustering , 2010, ICML.

[37]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[38]  P. Bickel,et al.  A nonparametric view of network models and Newman–Girvan and other modularities , 2009, Proceedings of the National Academy of Sciences.

[39]  Ling Huang,et al.  Fast approximate spectral clustering , 2009, KDD.

[40]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[41]  Chak-Kuen Wong,et al.  Worst-case analysis for region and partial region searches in multidimensional binary search trees and balanced quad trees , 1977, Acta Informatica.

[42]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[43]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[44]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[45]  Lise Getoor,et al.  Deduplication and Group Detection using Links , 2004 .

[46]  Mikhail Belkin,et al.  Consistency of spectral clustering , 2008, 0804.0678.

[47]  Stuart E. Madnick,et al.  The inter-database instance identification problem in integrating autonomous systems , 1989, [1989] Proceedings. Fifth International Conference on Data Engineering.

[48]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[49]  Jianzhong Li,et al.  Reasoning about Record Matching Rules , 2009, Proc. VLDB Endow..