论文信息 - Record Matching Over Query Result from Multiple Web Databases

Record Matching Over Query Result from Multiple Web Databases

Record matching, is the process of identifying the records that represent the same real-world entity, is an important step for data integration. Most record matching methods are supervised, which requires the user to provide training data. These methods are not applicable for the Web database scenario, where the records to match are query results dynamically generated, on the fly, such records are query-dependent and a pre-learned method using training examples from previous query results may fail on the results of a new query. To address the problem of record matching in the Web database scenario, we present an unsupervised, online record matching method, UDD, which, for a given query, can effectively identify duplicates from the query result records of multiple Web databases. This method has two cooperating classifiers, a Weighted Component Similarity Summing classifier (WCSS) and Support Vector Machine classifier (SVM), to iteratively identify duplicates in the query results from multiple Web databases. Using these two classifiers duplicate and non-duplicate vectors are calculated and non-duplicate vector is displayed as result.

Amol D. Potgantwar | Gokul K. Bodke | Meenal S. Khairnar | A. Potgantwar

[1] Raymond J. Mooney,et al. Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[2] Peter Christen,et al. Automatic record linkage using seeded nearest neighbour and support vector machine classification , 2008, KDD.

[3] Andrew McCallum,et al. Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[4] Rajeev Motwani,et al. Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[5] Surajit Chaudhuri,et al. Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[6] Divesh Srivastava,et al. Record linkage: similarity measures and algorithms , 2006, SIGMOD Conference.

[7] Weifeng Su,et al. Record Matching over Query Results from Multiple Web Databases , 2010, IEEE Transactions on Knowledge and Data Engineering.

[8] Jiawei Han,et al. PEBL: Web page classification without negative examples , 2004, IEEE Transactions on Knowledge and Data Engineering.

[9] Ahmed K. Elmagarmid,et al. Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[10] Matthew A. Jaro,et al. Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .