Focused Web Crawler with Page Change Detection Policy

Focused crawlers aim to search only the subset of the web related to a specific topic, and offer a potential solution to the problem. The major problem is how to retrieve the maximal set of relevant and quality pages. In this paper, We propose an architecture that concentrates more over page selection policy and page revisit policy The three-step algorithm for page refreshment serves the purpose. The first layer contributes to decision of page relevance using two methods. The second layer checks for whether the structure of a web page has been changed or not, the text content has been altered or whether an image is changed. Also a minor variation to the method of prioritizing URLs on the basis of forward link count has been discussed to accommodate the purpose of frequency of update. And finally, the third layer helps to update the URL repository.

[1]  Xin Zhang,et al.  HAWK: A Focused Crawler with Content and Link Analysis , 2008, 2008 IEEE International Conference on e-Business Engineering.

[2]  Nasser Yazdani,et al.  FICA: A Fast Intelligent Crawling Algorithm , 2007, IEEE/WIC/ACM International Conference on Web Intelligence (WI'07).

[3]  Nasser Yazdani,et al.  Recurrent Neural Networks for Robust Real-World Text Classification , 2007, IEEE/WIC/ACM International Conference on Web Intelligence (WI'07).

[4]  Jian Liu,et al.  Improvement of PageRank for Focused Crawler , 2007, Eighth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing (SNPD 2007).

[5]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[6]  Tian Ke,et al.  A framework of deep Web crawler , 2008, 2008 27th Chinese Control Conference.

[7]  Deren Chen,et al.  URL Rule Based Focused Crawler , 2008, 2008 IEEE International Conference on e-Business Engineering.

[8]  Debajyoti Mukhopadhyay,et al.  A New Approach to Design Domain Specific Ontology Based Web Crawler , 2007, 10th International Conference on Information Technology (ICIT 2007).

[9]  A. K. Sharma,et al.  Architecture for Parallel Crawling and Algorithm for Change Detection in Web Pages , 2007, 10th International Conference on Information Technology (ICIT 2007).

[10]  Yan Chun,et al.  An evolutionary relevance calculation measure in topic crawler , 2009, 2009 ISECS International Colloquium on Computing, Communication, Control, and Management.