An Improved Focused Crawler: Using Web Page Classification and Link Priority Evaluation

A focused crawler is topic-specific and aims selectively to collect web pages that are relevant to a given topic from the Internet. However, the performance of the current focused crawling can easily suffer the impact of the environments of web pages and multiple topic web pages. In the crawling process, a highly relevant region may be ignored owing to the low overall relevance of that page, and anchor text or link-context may misguide crawlers. In order to solve these problems, this paper proposes a new focused crawler. First, we build a web page classifier based on improved term weighting approach (ITFIDF), in order to gain highly relevant web pages. In addition, this paper introduces an evaluation approach of the link, link priority evaluation (LPE), which combines web page content block partition algorithm and the strategy of joint feature evaluation (JFE), to better judge the relevance between URLs on the web page and the given topic. The experimental results demonstrate that the classifier using ITFIDF outperforms TFIDF, and our focused crawler is superior to other focused crawlers based on breadth-first, best-first, anchor text only, link-context only, and content block partition in terms of harvest rate and target recall. In conclusion, our methods are significant and effective for focused crawler.

[1]  G. A. Barnard,et al.  Transmission of Information: A Statistical Theory of Communications. , 1961 .

[2]  Wanli Zuo,et al.  A New Method for Focused Crawler Cross Tunnel , 2006, RSKT.

[3]  Filippo Menczer,et al.  Adaptive Retrieval Agents: Internalizing Local Context and Scaling up to the Web , 2000, Machine Learning.

[4]  David D. Lewis,et al.  Evaluating and optimizing autonomous text classification systems , 1995, SIGIR '95.

[5]  Jun Li,et al.  Focused crawling by exploiting anchor text using decision tree , 2005, WWW '05.

[6]  Filippo Menczer,et al.  Topical Crawling for Business Intelligence , 2003, ECDL.

[7]  Padmini Srinivasan,et al.  Link Contexts in Classifier-Guided Topical Crawlers , 2006, IEEE Trans. Knowl. Data Eng..

[8]  Wenjun Liu,et al.  An improved focused crawler based on Semantic Similarity Vector Space Model , 2015, Appl. Soft Comput..

[9]  David M. Pennock,et al.  Using web structure for classifying and describing web pages , 2002, WWW.

[10]  Karen Spärck Jones IDF term weighting and IR research lessons , 2004, J. Documentation.

[11]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[12]  Brian D. Davison Topical locality in the Web , 2000, SIGIR '00.

[13]  Soumen Chakrabarti,et al.  Accelerated focused crawling through online relevance feedback , 2002, WWW.

[14]  Dr P M E De Bra Searching for Arbitrary Information in the WWW : the Fish − Search for Mosaic , 1994 .

[15]  Wanli Zuo,et al.  Tunneling enhanced by web page content block partition for focused crawling , 2008, Concurr. Comput. Pract. Exp..

[16]  Lu Liu,et al.  Focused crawling enhanced by CBP-SLC , 2013, Knowl. Based Syst..

[17]  Hema Banati,et al.  Focused crawling of tagged web resources using ontology , 2013, Comput. Electr. Eng..

[18]  Xin Zhang,et al.  HAWK: A Focused Crawler with Content and Link Analysis , 2008, 2008 IEEE International Conference on e-Business Engineering.

[19]  Yoelle Maarek,et al.  The Shark-Search Algorithm. An Application: Tailored Web Site Mapping , 1998, Comput. Networks.

[20]  Ahmed Patel,et al.  Application of structured document parsing to focused web crawling , 2011, Comput. Stand. Interfaces.

[21]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[22]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[23]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[24]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[25]  Kevin S. McCurley,et al.  Analysis of anchor text for web search , 2003, SIGIR.

[26]  Xiao Yafu Research on Focused Crawler Based on Naive Bayes Algorithm , 2012 .

[27]  Fatemeh Ahmadi-Abkenari,et al.  An architecture for a focused trend parallel Web crawler with the application of clickstream analysis , 2012, Inf. Sci..