论文信息 - CRAYSE: design and implementation of efficient text search algorithm in a web crawler

CRAYSE: design and implementation of efficient text search algorithm in a web crawler

CRAYSE1 is a SEarch WHIle CRAwl application, intended to perform fast searching of text in web pages. A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. This process is also called spidering. Search engines, use spidering as a means of providing up-to-date data. Most of the existing web-crawlers archive the contents of the web starting from the input URL. Search engines index the results of web-crawlers and then perform searching when queried. As such, the searching is not performed while crawling. Hence such softwares can not be used for general use by web browsers. Also, the existing search mechanism in web browsers, search only on the current page and not recursively through all the links present in that page. In order to overcome such disadvantages, we propose in this paper to implement a web crawler that searches for a pattern efficiently and recursively through all the links including pdf links while crawling. CRAYSE can be used as a general purpose open source software by web browsers. It can also be used for offine searching. Further, the applications that require selective archival of web pages (based on the presence of a key word), can deploy CRAYSE for efficient search operations. This paper focusses on the design and implementation of CRAYSE and its demonstration through web applications.

S. Selvakumar | V. Radhakishan | Yaser Farook

[1] Ricardo A. Baeza-Yates,et al. Scheduling algorithms for Web crawling , 2004, WebMedia and LA-Web, 2004. Proceedings.

[2] Herbert Schildt. Java 2: The Complete Reference, Fifth Edition , 2002 .

[3] Jenny Edwards,et al. An adaptive model for optimizing performance of an incremental web crawler , 2001, WWW '01.

[4] Hector Garcia-Molina,et al. Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[5] Torsten Suel,et al. Design and implementation of a high-performance distributed Web crawler , 2002, Proceedings 18th International Conference on Data Engineering.

[6] Donald E. Knuth,et al. Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[7] Marc Najork,et al. Mercator: A scalable, extensible Web crawler , 1999, World Wide Web.

[8] 英哉岩崎. 20世紀の名著名論：D. E. Knuth J. H. Morris V. R. Pratt : Fast pattern matching in Strings , 2004 .

[9] Jeffrey Scott Vitter,et al. Characterizing Web Document Change , 2001, WAIM.

[10] Marc Najork,et al. Breadth-first crawling yields high-quality pages , 2001, WWW '01.

[11] Bruce Eckel. Thinking in Java , 1998 .