论文信息 - Crawling the Web Surface Databases

Crawling the Web Surface Databases

The World Wide Web is growing at a rapid rate. A web crawler is a computer program which independently browses the World Wide Web. The size of web as on February 2007 was 29 billion pages. One of the most important uses of web page is in indexing purpose and keeping web pages up to date which can be used by search engine to serve the end user queries. Web is dynamic in nature; hence we need to update the web pages constantly. In this paper, we put forward a technique to update a page stored in web repository. This paper put forward an efficient method to refresh a page. We are proposing two methods for refreshing the page by comparing the page structure. First method compares the page structure with the help of tags used in it. And second method creates a document tree compare structures of pages.

Sachin Sharma | Vidushi Singhal

[1] Ali Selamat,et al. A Clickstream-based Focused Trend Parallel Web Crawler , 2010 .

[2] Marco Furini,et al. International Journal of Computer and Applications , 2010 .

[3] Hector Garcia-Molina,et al. Parallel crawlers , 2002, WWW.

[4] David Eichmann,et al. The RBSE spider — Balancing effective search against Web load , 1994, WWW Spring 1994.

[5] Renu Vig,et al. A Hybrid Revisit Policy For Web Search , 2012 .

[6] Marco Gori,et al. Focused Crawling Using Context Graphs , 2000, VLDB.

[7] Philip A. Bernstein,et al. Proceedings of the 2000 ACM SIGMOD : International Conference on Management of Data, May 16-18, 2000, Dallas, Texas , 2000 .

[8] J. P. Gupta,et al. Parallel crawler architecture and web page change detection , 2008 .

[9] Hector Garcia-Molina,et al. Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[10] Michael K. Bergman. White Paper: The Deep Web: Surfacing Hidden Value , 2001 .

[11] Martin van den Berg,et al. Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.