Focused Web Crawling Based on Incremental Learning

Focused web crawler collects relevant web pages of interested topics from the Internet. Most searchers have studied strategy based on an initial model to gather as many relevant web pages as possible in the focused web crawling. We have proposed a model named Concept Context Graph (CCG) based on Formal Concept Analysis (FCA) in previous study. However, web information continually change over time, the initial model representing outdated information can't reflect user's interested topics rightly. In this paper, we updated CCG based on incremental learning to get more topic relevant web pages. We extracted some Incremental Concept (IC) from new visited pages and inserted these IC into CCG by the semantic similarity between core concept and incremental concept. In addition, we deleted some concepts from CCG according to a given threshold b. Lastly, our experiment proved that there was a better result in focused web crawling by our method.