Data Integration for Many Data Sources using Context-Sensitive Similarity Metrics

Good similarity functions are crucial for many important subtasks in data integration, such as “soft joins” and data deduping, and one widely-used similarity function is TFIDF similarity. In this paper we describe a modification of TFIDF similarity that is more appropriate for certain datasets: namely, large data collections formed by merging together many smaller collections, each of which is (nearly) duplicate-free. Our similarity metric, called CX.IDF, shares TFIDF’s most important properties: it can be computed efficiently and stored compactly; it can be“learned”using few passes over a dataset (in experiments, one or three passes are used), and is wellsuited to parallelization; and finally, like TFIDF, it requires no labeled training data. In experiments, the new similarity function reduces matching errors relative to TFIDF by up to 80%, and reduces k-nearest neighbor classification error by 20% on average.

[1]  Lise Getoor,et al.  A Latent Dirichlet Model for Unsupervised Entity Resolution , 2005, SDM.

[2]  Gerald Salton,et al.  Automatic text processing , 1988 .

[3]  Raghu Ramakrishnan,et al.  Source-aware Entity Matching: A Compositional Approach , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[4]  William W. Cohen Data integration using similarity joins and a word-based information representation language , 2000, TOIS.

[5]  Hui Han,et al.  Name disambiguation in author citations using a K-way spectral clustering method , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[6]  Ian Davidson,et al.  Constrained Clustering: Advances in Algorithms, Theory, and Applications , 2008 .

[7]  Albert H. Smith,et al.  Probabilistic Representations for Integrating Unreliable Data Sources , 2007 .

[8]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near duplicate detection , 2008, WWW.

[9]  François Yvon,et al.  Robust Similarity Measures for Named Entities Matching , 2008, COLING.

[10]  Stuart J. Russell,et al.  Identity Uncertainty and Citation Matching , 2002, NIPS.

[11]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[12]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[13]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[14]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[15]  Claire Cardie,et al.  Clustering with Instance-Level Constraints , 2000, AAAI/IAAI.

[16]  William W. Cohen,et al.  Learning to match and cluster large high-dimensional data sets for data integration , 2002, KDD.

[17]  Carlos Alberto Heuser,et al.  Measuring quality of similarity functions in approximate data matching , 2007, J. Informetrics.

[18]  Sugato Basu,et al.  Adaptive product normalization: using online learning for record linkage in comparison shopping , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[19]  Pradeep Ravikumar,et al.  Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..

[20]  Howard R. Turtle,et al.  Query Evaluation: Strategies and Optimizations , 1995, Inf. Process. Manag..

[21]  Jennifer Widom,et al.  Swoosh: a generic approach to entity resolution , 2008, The VLDB Journal.