Nautilus: A Generic Framework for Crawling Deep Web

This paper presents Nautilus, which is a generic framework for crawling deep Web. We provide an abstraction of deep Web crawling process and mechanism of integrating heterogeneous business modules. A Federal Decentralized Architecture is proposed to ensemble advantages of existed P2P networking architectures. We also present effective policies to schedule crawling tasks. Experimental results show our scheduling policies have good performance on load-balance and overall throughput.

[1]  Juliana Freire,et al.  Searching for Hidden-Web Databases , 2005, WebDB.

[2]  Ricardo A. Baeza-Yates,et al.  Balancing Volume, Quality and Freshness in Web Crawling , 2002, HIS.

[3]  Krishna Bharat,et al.  SPHINX: A Framework for Creating Personal, Site-Specific Web Crawlers , 1998, Comput. Networks.

[4]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[5]  Juliana Freire,et al.  An adaptive crawler for locating hidden-Web entry points , 2007, WWW '07.

[6]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[7]  Hector Garcia-Molina,et al.  Effective page refresh policies for Web crawlers , 2003, TODS.

[8]  Marc Najork,et al.  Mercator: A scalable, extensible Web crawler , 1999, World Wide Web.

[9]  Sebastiano Vigna,et al.  UbiCrawler: a scalable fully distributed Web crawler , 2004, Softw. Pract. Exp..

[10]  Jin-Mao Wei,et al.  Ensemble Rough Hypercuboid Approach for Classifying Cancers , 2010, IEEE Transactions on Knowledge and Data Engineering.

[11]  Juliana Freire,et al.  Siphon++: a hidden-webcrawler for keyword-based interfaces , 2008, CIKM '08.

[12]  Wei Liu,et al.  ViDE: A Vision-Based Approach for Deep Web Data Extraction , 2010, IEEE Transactions on Knowledge and Data Engineering.

[13]  David R. Karger,et al.  Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web , 1997, STOC '97.

[14]  Man Lung Yiu,et al.  Group-by skyline query processing in relational engines , 2009, CIKM.

[15]  Jayant Madhavan,et al.  Google's Deep Web crawl , 2008, Proc. VLDB Endow..

[16]  Petros Zerfos,et al.  Downloading textual hidden web content through keyword queries , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[17]  Soumen Chakrabarti,et al.  Accelerated focused crawling through online relevance feedback , 2002, WWW.

[18]  Martin Halvey,et al.  WWW '07: Proceedings of the 16th international conference on World Wide Web , 2007, WWW 2007.