Adaptive Web Crawling Through Structure-Based Link Classification

Generic web crawling approaches cannot distinguish among various page types and cannot target content-rich areas of a website. We study the problem of efficient unsupervised web crawling of content-rich webpages. We propose ACEBot Adaptive Crawler Bot for data Extraction, a structure-driven crawler that uses the inner structure of the pages and guides the crawling process based on the importance of their content. ACEBot works in two phases: in the learning phase, it constructs a dynamic site map limiting the number of URLs retrieved and learns a traversal strategy based on the importance of navigation patterns selecting those leading to valuable content; in the intensive crawling phase, ACEBot performs massive downloading following the chosen navigation patterns. Experiments over a large dataset illustrate the effectiveness of our system.