Domain-oriented Deep Web Data Sources' Discovery and Identification

As Deep Web contains tremendous well-structured data sources, how to integrate data sources in Deep Web has become a hotspot in current research. Accurately discovering and identifying Deep Web data sources related to a specific domain become key issues. We propose a Domain-Oriented Deep Web data source Discovery method (DO-DWD) and a novel Domain Identification strategy of Deep Web data sources (DIDW). In the discovery stage, we use machine learning algorithms and some heuristic rules to find query interfaces of the data sources; In the identification stage, we identify Deep Web data sources associated with the domain by calculating the relevance between a query interface and the domain based on semantic similarity. Finally, we have extensive experiments on a real data set showing that DO-DWD and DIDW are of high correctness and accuracy.

[1]  Anne E. James,et al.  The categorisation of hidden Web databases through concept specificity and coverage , 2005, 19th International Conference on Advanced Information Networking and Applications (AINA'05) Volume 1 (AINA papers).

[2]  Oren Etzioni,et al.  A scalable comparison-shopping agent for the World-Wide Web , 1997, AGENTS '97.

[3]  Luis Gravano,et al.  Probe, count, and classify: categorizing hidden web databases , 2001, SIGMOD '01.

[4]  Martin Bergman,et al.  The deep web:surfacing the hidden value , 2000 .

[5]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[6]  Boris Chidlovskii,et al.  Crawling for domain-specific hidden Web resources , 2003, Proceedings of the Fourth International Conference on Web Information Systems Engineering, 2003. WISE 2003..

[7]  David Hawking,et al.  Automated Discovery of Search Interfaces on the Web , 2003, ADC.