Approximate Content Summary for Database Selection in Deep Web Data Integration

In Deep Web data integration, the metaquerier provides a unified interface for each domain, which can dispatch the user query to the most relevant Web databases. Traditional database selection algorithms are often based on content summaries. However, many web-accessible databases are uncooperative. The only way of accessing the contents of these databases is via querying. In this paper, we propose an approximate content summary approach for database selection. Furthermore, the real-life databases are not always static and, accordingly, the statistical content summary needs to be updated periodically to reflect database content changes. Therefore, we also propose a survival function approach to give appropriate schedule to regenerate approximate content summary. We conduct extensive experiments to illustrate the accuracy and efficiency of our techniques.