UDBFC: An effective focused crawling approach based on URL Distance calculation

Vertical search engines use focused crawlers as their key component and develops some specific algorithms to select web pages relevant to some pre-defined set of topics. Therefore, to effectively build up a semantic pattern for specific topics is extremely important to such search engines. Crawlers are software which can traverse the internet and retrieve web pages by hyperlinks. Here we propose an UDBFC (URL Distance Based Focused Crawler) algorithm based on a double crawler framework (an experimental crawler and a focused crawler). The main motive of our UDBFC is to measure the relevancy between seed page and child page by vector space model. Seed pages are the common search result generated by three most popular search engine Google, Yahoo and MSN search. Child page links are out links of seed page which are extracted by link extractor tool from seed page. Seed page and child page are fetched by experimental crawler. It calculates the relevancy between seed page and its all child pages. After relevancy calculation it defines groups based on relevancy score. It uses the focused crawler to fetch topic specific pages from internet based on distance score which is calculated between grouped URLs and each URL which is to be fetched.

[1]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[2]  Arputharaj Kannan,et al.  LSCrawler: A Framework for an Enhanced Focused Web Crawler Based on Link Semantics , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[3]  Deepak Singh Tomar,et al.  Effective Focused Crawling Based on Content and Link Structure Analysis , 2009, ArXiv.

[4]  Yulian Zhang,et al.  An Application of Improved PageRank in Focused Crawler , 2007, Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007).

[5]  Xin Zhang,et al.  HAWK: A Focused Crawler with Content and Link Analysis , 2008, 2008 IEEE International Conference on e-Business Engineering.

[6]  Deren Chen,et al.  URL Rule Based Focused Crawler , 2008, 2008 IEEE International Conference on e-Business Engineering.

[7]  Wang Beizhan,et al.  Efficient focused crawling strategy using combination of link structure and content similarity , 2008, 2008 IEEE International Symposium on IT in Medicine and Education.