Crawling the Web Surface Databases

The World Wide Web is growing at a rapid rate. A web crawler is a computer program which independently browses the World Wide Web. The size of web as on February 2007 was 29 billion pages. One of the most important uses of web page is in indexing purpose and keeping web pages up to date which can be used by search engine to serve the end user queries. Web is dynamic in nature; hence we need to update the web pages constantly. In this paper, we put forward a technique to update a page stored in web repository. This paper put forward an efficient method to refresh a page. We are proposing two methods for refreshing the page by comparing the page structure. First method compares the page structure with the help of tags used in it. And second method creates a document tree compare structures of pages.

[1]  Ali Selamat,et al.  A Clickstream-based Focused Trend Parallel Web Crawler , 2010 .

[2]  Marco Furini,et al.  International Journal of Computer and Applications , 2010 .

[3]  Hector Garcia-Molina,et al.  Parallel crawlers , 2002, WWW.

[4]  David Eichmann,et al.  The RBSE spider — Balancing effective search against Web load , 1994, WWW Spring 1994.

[5]  Renu Vig,et al.  A Hybrid Revisit Policy For Web Search , 2012 .

[6]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[7]  Philip A. Bernstein,et al.  Proceedings of the 2000 ACM SIGMOD : International Conference on Management of Data, May 16-18, 2000, Dallas, Texas , 2000 .

[8]  J. P. Gupta,et al.  Parallel crawler architecture and web page change detection , 2008 .

[9]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[10]  Michael K. Bergman White Paper: The Deep Web: Surfacing Hidden Value , 2001 .

[11]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[12]  Nidhi Tyagi,et al.  A Novel Architecture for Domain Specific Parallel Crawler , 2010 .

[13]  Hector Garcia-Molina,et al.  Effective page refresh policies for Web crawlers , 2003, TODS.

[14]  Zhen Liu,et al.  Optimal Robot Scheduling for Web Search Engines , 1998 .

[15]  C. Lee Giles,et al.  Evolving Strategies for Focused Web Crawling , 2003, ICML.

[16]  Hector Garcia-Molina,et al.  Synchronizing a database to improve freshness , 2000, SIGMOD '00.

[17]  Deepak Singh Tomar,et al.  Effective Focused Crawling Based on Content and Link Structure Analysis , 2009, ArXiv.

[18]  A. K. Sharma,et al.  A Novel Architecture for Deep Web Crawler , 2011, Int. J. Inf. Technol. Web Eng..

[19]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.