Towards automatic understanding and integration of web databases for developing large-scale unified access systems

The rapid growth of the World Wide Web has made numerous Web sources available for online access. Among them a significant number of Web sources are driven by relational databases such as Oracle and MySQL and are publicly accessible through Web-based search interfaces (i.e., HTML search forms). When the underlying databases are queried through user queries submitted against the Web search interfaces, the retrieved data values from the database are encoded and wrapped into dynamically generated HTML pages. We refer to such data sources as Web databases. Examples of Web databases are amazon. com, bestbuy.com and monster.com. As the Web continues to grow rapidly, it is less likely to search just one single Web database to obtain the desired information. Instead, information has to be acquired and assembled from multiple related Web databases. In order to enable the access efficiently and effectively, these Web databases need to be integrated and a mediator access interface shall be provided to ordinary Web users. With this mediator access interface, users can submit queries against this interface and the search mediator is responsible for sending out the translated sub-queries specific to each underlying Web database on behalf of users and then returning the combined search results of these Web databases to the users. Due to the semi-structured nature of HTML data, significant laborious human efforts, time and lots of expertise are involved in the process of building such integrated Web search systems, especially when the number of sources is large. Thus, it is critically necessary to develop intelligent techniques to facilitate the building of the integrated search system over Web databases and thereby to minimize the cost involved in the process. Building an integrated Web search system over Web databases consists of many research issues. In this dissertation, we focus on the three major issues: Web search interface understanding, interface integration, and returned results understanding. Our approach to building the mediator access interface is to integrate multiple Web search interfaces and build a unified search interface on top of them. In order to better understand and utilize Web databases, it is critical to first understand their Web search interfaces. We have proposed a schema model for representing form-based search interfaces of Web databases. This model precisely understands search interfaces by capturing logical attributes as well as a significant amount of useful semantic/meta information on this type of search interfaces. (Abstract shortened by UMI.)