Information extraction and integration for web databases
暂无分享,去创建一个
A large number off the Web pages returned by filling in search forms are not indexable by most search engines today since they are dynamically generated by querying a back-end (relational or object-relational) database. Referred to as Web databases, such Web sites usually contain complex data objects with nested structures in their Web pages. In this thesis, we address a variety of problems related to retrieving information from Web databases. To extract structured data embedded in template-generated pages from Web databases, we first develop an algorithm to automatically identify the data-rich sections in the page and then propose an innovative approach to automatically induce regular-expression wrappers from them. To understand the semantics of both the query interfaces and the extracted data from various Web databases and integrate them, we propose a combined schema model to describe differentiated schemas in a Web database (global, interface and result schema). We then address two significant schema-matching problems for Web databases, intra-site schema matching and inter-site schema matching, and investigate an instance-based method using domain-specific query probing to solve the two problems at the same time.