An Approach for Identifying URLs Based on Division Score and Link Score in Focused Crawler

rapid growth of the World Wide Web (WWW) poses unprecedented scaling challenges for general-purpose crawlers. Crawlers are software which can traverse the internet and retrieve web pages by hyperlinks. The focused crawler of a special-purpose search engine aims to selectively seek out pages that are relevant to a pre-defined set of topics, rather than to exploit all regions of the Web. Focused crawler is developed to collect relevant web pages of interested topics from the Internet. Maintaining currency of search engine indices by exhaustive crawling is rapidly becoming impossible due to the increasing size of the web. Focused crawlers aim to search only the subset of the web related to a specific topic, and offer a potential solution to the problem. In our proposed approach, we calculate the link score based on average relevancy score of parent pages (because we know that the parent page is always related to child page which means that for detailed information any author prefers the child page) and division score (means how many topic keywords belong to division in which particular link belongs). After finding out link score, we compare the link score with some threshold value. If link score is greater than or equal to threshold value, then it is relevant link. Otherwise, it is discarded. Focused crawler first fetches that link which has greater value compared to all link scores and threshold.

[1]  Wang Beizhan,et al.  Efficient focused crawling strategy using combination of link structure and content similarity , 2008, 2008 IEEE International Symposium on IT in Medicine and Education.

[2]  Arputharaj Kannan,et al.  LSCrawler: A Framework for an Enhanced Focused Web Crawler Based on Link Semantics , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[3]  Yulian Zhang,et al.  An Application of Improved PageRank in Focused Crawler , 2007, Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007).

[4]  Deren Chen,et al.  URL Rule Based Focused Crawler , 2008, 2008 IEEE International Conference on e-Business Engineering.

[5]  Gerard Salton,et al.  Research and Development in Information Retrieval , 1982, Lecture Notes in Computer Science.

[6]  Brian D. Davison Topical locality in the Web , 2000, SIGIR '00.

[7]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[8]  Xin Zhang,et al.  HAWK: A Focused Crawler with Content and Link Analysis , 2008, 2008 IEEE International Conference on e-Business Engineering.

[9]  Özgür Ulusoy,et al.  Exploiting interclass rules for focused crawling , 2004, IEEE Intelligent Systems.

[10]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[11]  Deepak Singh Tomar,et al.  Effective Focused Crawling Based on Content and Link Structure Analysis , 2009, ArXiv.