A Topic-specific Intelligent Web Crawler System

This paper introduces the topic-specific intelligent Web Crawler system and its crawling algorithm based on Web content and structure mining.The algorithm takes full advantage of the characteristics of the neural network and can simulate the network topology conveniently and parallel calculation.The paper introduces the reinforcement learning to judge the relativity between the crawled page and the topic.When calculating the correlation,without regarding to the whole content of the Web page,but to abstract the important tags of HTML makeup of the Web page,to analyze the content and structure of the page,thereby judge the relativity between the crawled page and the topic,improve the efficiency and accuracy of collected information enormously.