Record Matching Over Query Result from Multiple Web Databases

Record matching, is the process of identifying the records that represent the same real-world entity, is an important step for data integration. Most record matching methods are supervised, which requires the user to provide training data. These methods are not applicable for the Web database scenario, where the records to match are query results dynamically generated, on the fly, such records are query-dependent and a pre-learned method using training examples from previous query results may fail on the results of a new query. To address the problem of record matching in the Web database scenario, we present an unsupervised, online record matching method, UDD, which, for a given query, can effectively identify duplicates from the query result records of multiple Web databases. This method has two cooperating classifiers, a Weighted Component Similarity Summing classifier (WCSS) and Support Vector Machine classifier (SVM), to iteratively identify duplicates in the query results from multiple Web databases. Using these two classifiers duplicate and non-duplicate vectors are calculated and non-duplicate vector is displayed as result.