A Topic-Specific Web Crawler with Web Page Hierarchy Based on HTML Dom-Tree

With Internet growing exponentially, data mining in the web becomes the main method to find relevant information. With the amount of web sites and documents growing even faster and site contents getting updated more and more often, focused web crawler is becoming more and more popular. In the literature, how to order the unvisited URLs was studied deeply, they calculate the prediction score is based on the unvisited URLs’ ancestor, however the URLs in one web page is considered to have the same scores. In other words, they consider a web page has only one topic information. But we find the different parts of a web page have their own topic information, while they all support one or several big topics, so the URLs in different paragraphs should be given different scores based on the hierarchy relationship among them. In this paper, we parse every web page as a Dom-Tree, propose some rules in the tree aiming at extracting the relationship among different paragraphs, and then present a new topic-specific web crawler which calculates the unvisited URL’s prediction score based on the web page hierarchy and the text semantic similarity. We consider three factors, firstly, we calculate the text similarity using vector space model (VSM) which consider the query or paragraph as a vector in which the terms are independent. But there are relations about terms’ sequences in a text paragraph; we try to using edit distance based on terms’ sequences to avoid it. Thirdly, different paragraphs in a web page are contacted according to their hierarchy in a Dom-Tree. At last we combine the three factors in our crawler’s strategy and present our model.