论文信息 - A Topic-Specific Web Crawler with Web Page Hierarchy Based on HTML Dom-Tree

A Topic-Specific Web Crawler with Web Page Hierarchy Based on HTML Dom-Tree

With Internet growing exponentially, data mining in the web becomes the main method to find relevant information. With the amount of web sites and documents growing even faster and site contents getting updated more and more often, focused web crawler is becoming more and more popular. In the literature, how to order the unvisited URLs was studied deeply, they calculate the prediction score is based on the unvisited URLs’ ancestor, however the URLs in one web page is considered to have the same scores. In other words, they consider a web page has only one topic information. But we find the different parts of a web page have their own topic information, while they all support one or several big topics, so the URLs in different paragraphs should be given different scores based on the hierarchy relationship among them. In this paper, we parse every web page as a Dom-Tree, propose some rules in the tree aiming at extracting the relationship among different paragraphs, and then present a new topic-specific web crawler which calculates the unvisited URL’s prediction score based on the web page hierarchy and the text semantic similarity. We consider three factors, firstly, we calculate the text similarity using vector space model (VSM) which consider the query or paragraph as a vector in which the terms are independent. But there are relations about terms’ sequences in a text paragraph; we try to using edit distance based on terms’ sequences to avoid it. Thirdly, different paragraphs in a web page are contacted according to their hierarchy in a Dom-Tree. At last we combine the three factors in our crawler’s strategy and present our model.

Yuekui Yang | Yajun Du | Yufeng Hai | Zhaoqiong Gao

[1] Vladimir I. Levenshtein,et al. Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[2] Fan Wu,et al. Topic-specific crawling on the Web with the measurements of the relevancy context graph , 2006, Inf. Syst..

[3] Arnon Rungsawang,et al. Learnable topic-specific web crawler , 2002, J. Netw. Comput. Appl..

[4] Martin van den Berg,et al. Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[5] Jingyu Sun,et al. A Topic-Specific Web Crawler with Concept Similarity Context Graph Based on FCA , 2008, ICIC.

[6] Peter N. Yianilos,et al. Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[7] T. Breuel. Information Extraction from HTML Documents by Structural Matching , 2003 .

[8] Zuo Wan-li. Using Hyperlink Information to Improve Crawler's Searching Strategy , 2005 .

[9] T. Landauer,et al. Indexing by Latent Semantic Analysis , 1990 .

[10] Ioannis Pitas,et al. Combining text and link analysis for focused crawling - An application for vertical search engines , 2007, Inf. Syst..

[11] 坂倉省吾,et al. Technology Review : 抄録雑誌の概要 , 1987 .