Parallel crawler architecture and web page change detection

In this paper, we put forward a technique for parallel crawling of the web. The World Wide Web today is growing at a phenomenal rate. It has enabled a publishing explosion of useful online information, which has produced the unfortunate side effect of information overload. The size of the web as on February 2007 stands at around 29 billion pages. One of the most important uses of crawling the web is for indexing purposes and keeping web pages up-to-date, later used by search engine to serve the end user queries. The paper puts forward an architecture built on the lines of client server architecture. It discuses a fresh approach for parallel crawling the web using multiple machines and integrates the trivial issues of crawling also. A major part of the web is dynamic and hence, a need arises to constantly update the changed web pages. We have used a three-step algorithm for page refreshment. This checks for whether the structure of a web page has been changed or not, the text content has been altered or whether an image is changed. For The server we have discussed a unique method for distribution of URLs to client machines after determination of their priority index. Also a minor variation to the method of prioritizing URLs on the basis of forward link count has been discussed to accommodate the purpose of frequency of update.

[1]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[2]  Frank M. Shipman,et al.  Perception of content, structure, and presentation changes in Web-based hypertext , 2001, Hypertext.

[3]  Serge Abiteboul,et al.  Detecting changes in XML documents , 2002, Proceedings 18th International Conference on Data Engineering.

[4]  Hector Garcia-Molina,et al.  Parallel crawlers , 2002, WWW.

[5]  Hector Garcia-Molina,et al.  Effective page refresh policies for Web crawlers , 2003, TODS.

[6]  廷冕 李,et al.  応用 (Application) について , 1981 .

[7]  Zhen Liu,et al.  Optimal Robot Scheduling for Web Search Engines , 1998 .

[8]  Daniel Rocco,et al.  Efficient web change monitoring with page digest , 2004, WWW Alt. '04.

[9]  Daniel Rocco,et al.  Page Digest for large-scale Web services , 2003, EEE International Conference on E-Commerce, 2003. CEC 2003..

[10]  L. Khan,et al.  Change Detection of XML Documents Using Signatures , 2002 .

[11]  Hector Garcia-Molina,et al.  Synchronizing a database to improve freshness , 2000, SIGMOD '00.

[12]  George Samaras,et al.  Distributed location aware web crawling , 2004, WWW Alt. '04.

[13]  David J. DeWitt,et al.  X-Diff: an effective change detection algorithm for XML documents , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[14]  Calton Pu,et al.  WebCQ-detecting and delivering information changes on the web , 2000, CIKM '00.

[15]  Edward A. Fox,et al.  Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries , 2001 .

[16]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[17]  Minoru Uehara,et al.  Distributed information retrieval by using cooperative meta search engines , 2001, Proceedings 21st International Conference on Distributed Computing Systems Workshops.

[18]  Hector Garcia-Molina,et al.  Estimating frequency of change , 2003, TOIT.

[19]  Hector Garcia-Molina,et al.  Synchronizing a database to improve freshness , 2000, SIGMOD 2000.

[20]  Christopher Olston,et al.  What's new on the web?: the evolution of the web from a search engine perspective , 2004, WWW '04.

[21]  Curtis E. Dyreson,et al.  Schema-Less, Semantics-Based Change Detection for XML Documents , 2004, WISE.

[22]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[23]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2003, WWW '03.

[24]  Divakar Yadav,et al.  Architecture for Parallel Crawling and Algorithm for Change Detection in Web Pages , 2007 .

[25]  Marc Najork,et al.  Mercator: A scalable, extensible Web crawler , 1999, World Wide Web.

[26]  A. K. Sharma,et al.  Change Detection in Web Pages , 2007, 10th International Conference on Information Technology (ICIT 2007).

[27]  David Eichmann,et al.  The RBSE spider — Balancing effective search against Web load , 1994, WWW Spring 1994.

[28]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[29]  Frank M. Shipman,et al.  Managing change on the web , 2001, JCDL '01.

[30]  Philip S. Yu,et al.  Optimal crawling strategies for web search engines , 2002, WWW '02.

[31]  Dustin Boswell Distributed High-performance Web Crawlers : A Survey of the State of the Art , 2003 .