Web Page's Blocks Based Topical Crawler

Link context has been widely used in information retrieval and classification. In topical crawlers or vertical crawlers, the link contexts are used to forecast whether the links are related to topics. The context of a link or link context usually includes the anchor text of the link, the whole web page text or the words in the fixed scope near the link. The entire text of the page often contains too many themes, anchor text is too simple, and the scope of fixed windows is not easy to determine. In this paper, we propose to decide the scope of link context by the web page block technology. The links in the same block are more closely related. The corner classification based neural network is used to represent and filter the topics. Our experiments show that web crawlers using web page block based link context have better accuracy, and that the corner classification neural network is suitable for representing and filtering topics.