CCReSD: concept-based categorisation of Hidden Web databases

Hidden Web databases dynamically generate results in response to users' queries. The categorisation of such databases into a category scheme has been widely employed in information searches. We present a Concept-based Categorisation over Refined Sampled Documents (CCReSD) approach that effectively handles information extraction, summarisation and categorisation of such databases. CCReSD detects and extracts query-related information from sampled documents of databases. It generates terms and frequencies to summarise database contents. It also generates descriptions of concepts from their coverage and specificity given in a category scheme. We conduct experiments to evaluate our approach and to show that it assigns databases with more relevant subject categories.

[1]  Wei-Ying Ma,et al.  Block-based web search , 2004, SIGIR '04.

[2]  David W. Embley,et al.  On the Automatic Extraction of Data from the Hidden Web , 2001, ER.

[3]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[4]  James P. Callan,et al.  Query-based sampling of text databases , 2001, TOIS.

[5]  Anne E. James,et al.  A two-phase sampling technique for information extraction from hidden web databases , 2004, WIDM '04.

[6]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[7]  Anne E. James,et al.  The categorisation of hidden Web databases through concept specificity and coverage , 2005, 19th International Conference on Advanced Information Networking and Applications (AINA'05) Volume 1 (AINA papers).

[8]  Hui Chen,et al.  Automatic information discovery from the "invisible Web" , 2002, Proceedings. International Conference on Information Technology: Coding and Computing.

[9]  Oren Etzioni,et al.  Query routing for Web search engines: architecture and experiments , 2000, Comput. Networks.

[10]  Clement T. Yu,et al.  Concept Hierarchy-Based Text Database Categorization , 2002, Knowledge and Information Systems.

[11]  Luis Gravano,et al.  QProber: A system for automatic classification of hidden-Web databases , 2003, TOIS.

[12]  Roland H. C. Yap,et al.  Automatic information extraction from web pages , 2001, SIGIR '01.

[13]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[14]  Víctor Pàmies,et al.  Open Directory Project , 2003 .

[15]  Tobias Dönz Extracting Structured Data from Web Pages , 2003 .

[16]  Ling Liu,et al.  Probe, cluster, and discover: focused extraction of QA-Pagelets from the deep Web , 2004, Proceedings. 20th International Conference on Data Engineering.