Research of theme crawling strategy based on genetic algorithm

Aiming at the subject drifting problem of topic crawling, this paper presents a theme crawling strategy of web crawler. Based on Genetic Algorithm, this strategy absorbs PageRank algorithm and correlation of web page and theme, re-sets the fitness function and adjusts size of correlative parameters of calculation. In this way, superior gene individual is selected firstly and subject drifting problem is reduced. Compared with previous strategies based on genetic algorithm, the number of web pages relevant to the crawling subject can be raised more than 5%.