A two-phase sampling technique for information extraction from hidden web databases

Hidden Web databases maintain a collection of specialised documents, which are dynamically generated in response to users' queries. However, the documents are generated by Web page templates, which contain information that is irrelevant to queries. This paper presents a Two-Phase Sampling (2PS) technique that detects templates and extracts query-related information from the sampled documents of a database. In the first phase, 2PS queries databases with terms contained in their search interface pages and the subsequently sampled documents. This process retrieves a required number of documents. In the second phase, 2PS detects Web page templates in the sampled documents in order to extract information relevant to queries. We test 2PS on a number of real-world Hidden Web databases. Experimental results demonstrate that 2PS effectively eliminates irrelevant information contained in Web page templates and generates terms and frequencies with improved accuracy.

[1]  Anne E. James,et al.  Query-related data extraction of hidden web documents , 2004, SIGIR '04.

[2]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[3]  Oren Etzioni,et al.  Query routing for Web search engines: architecture and experiments , 2000, Comput. Networks.

[4]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[5]  James P. Callan,et al.  Query-based sampling of text databases , 2001, TOIS.

[6]  James Caverlee,et al.  Discovering Objects in Dynamically-Generated Web Pages , 2003 .

[7]  David W. Embley,et al.  On the Automatic Extraction of Data from the Hidden Web , 2001, ER.

[8]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[9]  Clement T. Yu,et al.  Concept Hierarchy-Based Text Database Categorization , 2002, Knowledge and Information Systems.

[10]  Luis Gravano,et al.  QProber: A system for automatic classification of hidden-Web databases , 2003, TOIS.

[11]  Roland H. C. Yap,et al.  Automatic information extraction from web pages , 2001, SIGIR '01.

[12]  Alberto H. F. Laender,et al.  Automatic generation of agents for collecting hidden Web pages for data extraction , 2004, Data Knowl. Eng..

[13]  Hui Chen,et al.  Automatic information discovery from the "invisible Web" , 2002, Proceedings. International Conference on Information Technology: Coding and Computing.

[14]  Hector Garcia-Molina,et al.  Extracting structured data from Web pages , 2003, SIGMOD '03.