论文信息 - Web Content Extraction Using Clustering with Web Structure

Web Content Extraction Using Clustering with Web Structure

Web content extraction is an essential part of data preprocessing in web information system. An algorithm for web content extraction based on clustering with web structure is proposed. The whole process can be divided in two steps. In the first step, clustering with the web pages collected from different websites. During this processing, similarity measurement of web page based on dynamic programming of weight is used. First, the web page is parsed to DOM tree; second, the weight is assigned to every node according to the position of the node and the amount of nodes in same depth and the depth of the DOM tree; third, calculating the similarity of two pages according to the given formula. When the first step is finished, web pages with similar structure would be divided into a set. In the second step, pages in the same set are compared and the same parts of pages will be removed, thus the remain is the web content. Experiments show that the proposed algorithm works with great effectiveness and accuracy.

[1] Yuan Li,et al. Content Extraction from Chinese Web Pages Based on Punctuations Distribution , 2012, 2012 International Conference on Computer Science and Service System.

[2] Wei-Ying Ma,et al. VIPS: a Vision-based Page Segmentation Algorithm , 2003 .

[3] Isabel F. Cruz,et al. Measuring Structural Similarity Among Web Documents: Preliminary Results , 1998, EP.

[4] Pabitra Mitra,et al. Extracting semantic structure of web documents using content and visual information , 2005, WWW '05.

[5] Veenu Mangat,et al. A novel approach for content extraction from web pages , 2014, 2014 Recent Advances in Engineering and Computational Sciences (RAECS).

[6] Lin Mao-song. An Extraction Algorithm of Chinese HTML Content Based on Similarity , 2010 .

[7] Zeng Li-fang. Content extraction technique for web pages based on HTML-tags , 2010 .

[8] Wei-Ying Ma,et al. Extracting Content Structure for Web Pages Based on Visual Representation , 2003, APWeb.

[9] Jie Chen,et al. Combining a segmentation-like approach and a density-based approach in content extraction , 2012 .

[10] Yan Guo,et al. ECON: An Approach to Extract Content from Web News Page , 2010, 2010 12th International Asia-Pacific Web Conference.

[11] Sachindra Joshi,et al. A bag of paths model for measuring structural similarity in Web documents , 2003, KDD '03.