Eecient Web Spidering with Reinforcement Learning

Consider the task of exploring the Web in order to nd pages of a particular kind or on a particular topic. This task arises in the construction of domain-speciic search engines. A selective, directed web spider can be much more eecient than a spider that gathers new pages indiscriminantly. This paper argues that the creation of eecient web spiders is best framed and solved by reinforcement learning, a branch of machine learning that concerns itself with optimal sequential decision making. One strength of reinforcement learning is that it provides a formalism for measuring the utility of actions that give no immediate beneet, but do give beneet in the future. Topic-speciic spidering ts into the reinforcement learning framework because valuing hyperlinks with future reward is important. Experimental results on large collections of real web data show that a reinforcement learning spider nds relevant pages three times faster than a competing spider based on breadth rst search. The results also show that our spider is not yet taking full advantage of future utility because of inaccuracies in our approximation for mapping hyper-links to their expected future utility. Thus we believe that improving the accuracy of this mapping will increase performance even further, and we present ideas for doing so.