论文信息 - Towards a Keyword-Focused Web Crawler

Towards a Keyword-Focused Web Crawler

This paper concerns predicting the content of textual web documents based on features extracted from web pages that link to them. It may be applied in an intelligent, keyword-focused web crawler. The experiments made on publicly available real data obtained from Open Directory Project with the use of several classification models are promising and indicate potential usefulness of the studied approach in automatically obtaining keyword-rich web document collections.

Marcin Sydow | Tomasz Kusmierczyk

[1] SangKeun Lee,et al. Novel approaches to crawling important pages early , 2012, Knowledge and Information Systems.

[2] Marti A. Hearst. Trends & Controversies: Support Vector Machines , 1998, IEEE Intell. Syst..

[3] Martin van den Berg,et al. Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[4] Philip S. Yu,et al. Intelligent crawling on the World Wide Web with arbitrary predicates , 2001, WWW '01.

[5] Andreas Christmann,et al. Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[6] Marco Gori,et al. Focused Crawling Using Context Graphs , 2000, VLDB.

[7] Padmini Srinivasan,et al. Learning to crawl: Comparing classification schemes , 2005, TOIS.

[8] Brian D. Davison. Topical locality in the Web , 2000, SIGIR '00.

[9] Pat Langley,et al. Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[10] Leo Breiman,et al. Classification and Regression Trees , 1984 .