Sources Selection Methodology for Hidden Web Data Integration

Abstract —In the internet-scale hidden web data integration, The problem of sources(web databases) selection has been a primary challenge. This paper proposes a novel approach for web databases selection of internet-scale hidden web data integration. This approach is based on a benefit function that evaluates how much benefit the web database brings to a given status of integration system by integrating it. With the estimated benefit information, web databases selection can be made in an iteratively manner. Preliminary results show that our technique provides an effective mechanism to select and integrate web databases. Index Terms —hidden web, data integration, web database selection. I. I NTRODUCTION More and more web databases are becoming web accessible through search interfaces. This information is often called the “hidden web”or “deep web”, the hidden web is believed to be possibly larger than the "surface web", and typically has very high-quality contents [1]. According to the survey [2] released by UIUC in 2004, there are more than 300,000 hidden web sites and 450,000 query interfaces available at that time, and the two figures are still increasing rapidly. There may be hundreds or thousands of web databases providing data of relevance to a particular domain in the web. The user may not want to include all available web databases in the integration system being defined and also may not want to query all web database in the system to a user's query, especially if there is significant overlap in the data in the different web databases and a lot of the low quality of the web databases. Moreover, there are networking and processing costs associated with including a web database in the integration system. These are the costs to retrieve data from the database while executing queries, map this data to the global mediated schema and so on. The more sources we have, the higher these costs. So a integration system cannot possibly involve in all of them, The problem of web database selection has been a primary challenge to internet-scale hidden web data integration. In the internet-scale hidden web data integration, the problem of web database selection emerge in two-phases. First-phase, before building an integration system, the m web databases must be automatically selected to integrate from hundreds or thousands of web databases relevance to a particular domain. m is the maximum number of web databases that the user is willing to select. Second-phase, after building an integration system, given a query, a set of the most relevant web databases must be selected to do the search. In traditional small-scale data integration tasks, domain expertise determines in which web databases that should be included in the integration system. So there is a little work on the first-phase[3][4]. In this paper, we study the problem of automating the selection of web databases to integrate in first-phase. In this paper, our goal is to select and integrate m web databases that contain as much high-quality data as possible and the least degree of overlap between the data in the integration system. We begin by presenting an approach for iteratively selecting and integrating hidden web database. The approach selects a most benefit web database from a set of candidate web databases to integrate each time. After when each web database is integrated, we update the status of integration system and recompute the next most benefit web database to integrate. The core of this approach is a benefit function that evaluates how much benefit the web database bring to a given state of a integration system by integrating it. Thus, we devise a benefit function for web database based on the volume and quality of new data that added to the integration system by integrating the web database. Preliminary results show that our technique provides an effective mechanism to select and integrate web databases. The remaining of the paper is organized as follows. In Section 2 discusses our benefit function for evaluating the benefit of web databases. Section 3 describes the algorithms of web database selection. Section 4 presents a detailed evaluation of our web database selecting strategy. We conclude in section 5. II. B