A New Framework for Domain-Specific Hidden Web Crawling Based on Data Extraction Techniques

The World Wide Web continues to grow at an exponential rate which makes exploiting all useful information a standing challenge. Search engines like "Google" crawl and index a large amount of information, ignoring valuable data that represent 80% of the content on the Web, this portion of Web called Hidden Web (HW), they are "Hidden" in databases behind search interfaces. In this paper, a framework of a HW crawler is proposed to crawl and extract hidden Web pages. Two unique features of our framework are 1) the classification phase for grouping HW and Publicly Indexable Web (PIW) pages into distinct classes, so that making our crawler performs well in both the domain-specific and random mode of crawling and 2) the capability of dealing with single-attribute and multi-attribute databases. Three novel algorithms proposed in the framework, one for collecting Web pages, one for identifying relevant forms, and one for extracting labels. The effectiveness of proposed algorithms is evaluated through experiments using real Web sites. The preliminary results are very promising. For instance, one of these algorithms proves to be accurate (over 99% precision and 100 % recall).

[1]  Augusto de Carvalho Fontes,et al.  SmartCrawl: a new strategy for the exploration of the hidden web , 2004, WIDM '04.

[2]  David W. Embley,et al.  Extracting Data behind Web Forms , 2002, ER.

[3]  Boris Chidlovskii,et al.  Crawling for domain-specific hidden Web resources , 2003, Proceedings of the Fourth International Conference on Web Information Systems Engineering, 2003. WISE 2003..

[4]  Luis Gravano,et al.  QProber: A system for automatic classification of hidden-Web databases , 2003, TOIS.

[5]  Zvi Galil,et al.  An Improved Algorithm for Approximate String Matching , 1989, SIAM J. Comput..

[6]  Alberto H. F. Laender,et al.  Automatic generation of agents for collecting hidden Web pages for data extraction , 2004, Data Knowl. Eng..

[7]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[8]  Petros Zerfos,et al.  Downloading textual hidden web content through keyword queries , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[9]  Frederick H. Lochovsky,et al.  Data extraction and label assignment for web databases , 2003, WWW '03.

[10]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[11]  Giles,et al.  Searching the world wide Web , 1998, Science.

[12]  C. Lee Giles,et al.  Accessibility of information on the web , 1999, Nature.

[13]  Hector Garcia-Molina,et al.  Synchronizing a database to improve freshness , 2000, SIGMOD '00.

[14]  Ling Liu,et al.  QA-Pagelet: data preparation techniques for large-scale data analysis of the deep Web , 2005, IEEE Transactions on Knowledge and Data Engineering.

[15]  Daniel P. Lopresti,et al.  Block Edit Models for Approximate String Matching , 1997, Theor. Comput. Sci..

[16]  Alberto O. Mendelzon,et al.  Database techniques for the World-Wide Web: a survey , 1998, SGMD.

[17]  Sourav S. Bhowmick,et al.  DEQUE: querying the deep web , 2005, Data Knowl. Eng..

[18]  Hui Chen,et al.  Automatic information discovery from the "invisible Web" , 2002, Proceedings. International Conference on Information Technology: Coding and Computing.

[19]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .