Semantic deep web: automatic attribute extraction from the deep web data sources

"Deep Web" refers to the rich information and data hidden in backend databases, etc., that search engines or Web crawlers cannot access. It is mostly accessible through manual query interfaces. This paper introduces the Semantic Deep Web, utilizing an ontology to determine relevance of query interface attributes to access the Deep Web. In addition, we present a novel approach to automatically extracting attributes from query interfaces in order to address the current limitations in accessing Deep Web data sources. Our Automatic Attribute Extraction method (1) identifies attributes that are used by query Web page designers, called Programmer Viewpoint Attributes, and (2) attributes that are presented as labels to users, called User Viewpoint Attributes. An ontology enriches the candidate query attributes by providing synonyms and by supporting the attributes used by designers and users. Our experimental results in several e-commerce domains show that the attributes obtained by our algorithm compare favorably with manually determined attributes to be used for Deep Web queries.