Automatic wrapper generation for the extraction of search result records from search engines

The deep web, which is estimated about 500 times larger than that of the surface web, is extremely under-utilized. Researchers have been working on various issues towards the building of large-scale deep web applications, which aim at unleashing the real power of the deep web. One of the key issues facing large-scale deep applications is the extraction and understanding of the data returned by deep web sites. In order to utilize the data in deep web sites, we need to extract the data (search result records) from the search result pages, which are web pages that contain both the data of interest and other unrelated content, returned by the deep web sites. Data extraction from web pages is generally a very hard problem. The performances of existing researches in the literature are far from satisfactory. This dissertation studies the problem of extracting search result records from search engine returned pages in both the deep web sites and the surface web sites. A method that combines both the visual content features and the HTML tag structures the result pages is proposed to generate wrappers for the extraction of search result records. This novel technique archives significantly better performance than that of the state-of-the-art researches. To extract search result records from categorized result pages requires maintaining the section-record relationships. Major issues like section boundaries and optional sections make achieving a good performance difficult. We introduce a novel method based on the content properties of search result records and the dynamic properties of sections. A search result record usually consists of multiple data units. The semi-structured nature of search result records makes the data units extraction a hard problem. The mismatches between the HTML tag structures and the data structure of search result records as well as the optional and disjunctive data units further limit the performance. We introduce a novel directed acyclic graph representation of search result record templates, which can be used to extract data units from search result records. An effective machine learning and statistics based algorithm that extracts templates from search result records is also presented.