Consider the task of exploring the Web in order to nd pages of a particular kind or on a particular topic. This task arises in the construction of domain-speciic search engines. A selective, directed web spider can be much more eecient than a spider that gathers new pages indiscriminantly. This paper argues that the creation of eecient web spiders is best framed and solved by reinforcement learning, a branch of machine learning that concerns itself with optimal sequential decision making. One strength of reinforcement learning is that it provides a formalism for measuring the utility of actions that give no immediate beneet, but do give beneet in the future. Topic-speciic spidering ts into the reinforcement learning framework because valuing hyperlinks with future reward is important. Experimental results on large collections of real web data show that a reinforcement learning spider nds relevant pages three times faster than a competing spider based on breadth rst search. The results also show that our spider is not yet taking full advantage of future utility because of inaccuracies in our approximation for mapping hyper-links to their expected future utility. Thus we believe that improving the accuracy of this mapping will increase performance even further, and we present ideas for doing so.
[1]
Yoav Shoham,et al.
Learning Information Retrieval Agents: Experiments with Automated Web Browsing
,
1995
.
[2]
Thomas M. Cover,et al.
Elements of Information Theory
,
2005
.
[3]
Thomas G. Dietterich.
What is machine learning?
,
2020,
Archives of Disease in Childhood.
[4]
David D. Lewis,et al.
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval
,
1998,
ECML.
[5]
Filippo Menczer,et al.
ARACHNID: Adaptive Retrieval Agents Choosing Heuristic Neighborhoods for Information Discovery
,
1997,
ICML 1997.
[6]
Dayne Freitag,et al.
A Machine Learning Architecture for Optimizing Web Search Engines
,
1999
.
[7]
Andrew W. Moore,et al.
Reinforcement Learning: A Survey
,
1996,
J. Artif. Intell. Res..
[8]
Andrew McCallumzy,et al.
Building Domain-speciic Search Engines with Machine Learning Techniques
,
1999
.
[9]
T. Joachims.
WebWatcher : A Tour Guide for the World Wide Web
,
1997
.
[10]
Hector Garcia-Molina,et al.
Efficient Crawling Through URL Ordering
,
1998,
Comput. Networks.