论文信息 - Domain-based data integration for web databases

Domain-based data integration for web databases

An important part of today’s Web is Web databases, in which 80% of the databases are structured databases. To facilitate a user to retrieve relevant records from different Web databases simultaneously, we propose a simultaneous querying system, called SIM-querying, which is comprised of three components: query interface integrator, data extractor and result integrator. In each component, a novel method is presented that performs its function automatically. In the query interface integrator, a holistic schema matching method, HSM, is presented that takes advantage of the attribute occurrence patterns in multiple query interfaces to find the attributes that match in different interfaces within a domain. In the data extractor, a domain-based data extraction method, ODE, is presented. In ODE, a domain ontology is first learned from the information overlap and schema matching in the query results and query interfaces from different Web databases within the domain and the ontology is then used to extract the data encoded in the result HTML pages automatically. In the result integrator, a new duplicate detection method, UDD, is presented to identify the duplicates that exist in the query results from different Web databases. In UDD, a set of negative records is first constructed based on two observations about the query results of Web databases and then, starting from the negative records, an iterative algorithm identifies the duplicates from different Web databases. Experimental results show that each of these novel methods can achieve very high precision and outperform existing methods in the context of Web databases.

Weifeng Su | Frederick H. Lochovsky