论文信息 - Data Integration for Many Data Sources using Context-Sensitive Similarity Metrics

Data Integration for Many Data Sources using Context-Sensitive Similarity Metrics

Good similarity functions are crucial for many important subtasks in data integration, such as “soft joins” and data deduping, and one widely-used similarity function is TFIDF similarity. In this paper we describe a modification of TFIDF similarity that is more appropriate for certain datasets: namely, large data collections formed by merging together many smaller collections, each of which is (nearly) duplicate-free. Our similarity metric, called CX.IDF, shares TFIDF’s most important properties: it can be computed efficiently and stored compactly; it can be“learned”using few passes over a dataset (in experiments, one or three passes are used), and is wellsuited to parallelization; and finally, like TFIDF, it requires no labeled training data. In experiments, the new similarity function reduces matching errors relative to TFIDF by up to 80%, and reduces k-nearest neighbor classification error by 20% on average.

[1] Lise Getoor,et al. A Latent Dirichlet Model for Unsupervised Entity Resolution , 2005, SDM.

[2] Gerald Salton,et al. Automatic text processing , 1988 .

[3] Raghu Ramakrishnan,et al. Source-aware Entity Matching: A Compositional Approach , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[4] William W. Cohen. Data integration using similarity joins and a word-based information representation language , 2000, TOIS.

[5] Hui Han,et al. Name disambiguation in author citations using a K-way spectral clustering method , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[6] Ian Davidson,et al. Constrained Clustering: Advances in Algorithms, Theory, and Applications , 2008 .

[7] Albert H. Smith,et al. Probabilistic Representations for Integrating Unreliable Data Sources , 2007 .

[8] Jeffrey Xu Yu,et al. Efficient similarity joins for near duplicate detection , 2008, WWW.

[9] François Yvon,et al. Robust Similarity Measures for Named Entities Matching , 2008, COLING.

[10] Stuart J. Russell,et al. Identity Uncertainty and Citation Matching , 2002, NIPS.

[11] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[12] Andrew McCallum,et al. Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[13] Pradeep Ravikumar,et al. A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[14] Roberto J. Bayardo,et al. Scaling up all pairs similarity search , 2007, WWW '07.

[15] Claire Cardie,et al. Clustering with Instance-Level Constraints , 2000, AAAI/IAAI.

[16] William W. Cohen,et al. Learning to match and cluster large high-dimensional data sets for data integration , 2002, KDD.

[17] Carlos Alberto Heuser,et al. Measuring quality of similarity functions in approximate data matching , 2007, J. Informetrics.

[18] Sugato Basu,et al. Adaptive product normalization: using online learning for record linkage in comparison shopping , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[19] Pradeep Ravikumar,et al. Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..

[20] Howard R. Turtle,et al. Query Evaluation: Strategies and Optimizations , 1995, Inf. Process. Manag..

[21] Jennifer Widom,et al. Swoosh: a generic approach to entity resolution , 2008, The VLDB Journal.