论文信息 - Filtering noise in Web pages based on parsing tree

Filtering noise in Web pages based on parsing tree

Abstract This paper proposes a novel method to filter web pages using parsing tree. Firstly, this paper explains how features of noises in web pages can be analyzed and extracted. Secondly, this paper explains how the parsing tree of the web pages can be built using document object model (DOM). Finally, this paper explains how domain specific extraction rules and statistic methods can be deployed to eliminate noises and to extract main texts from the web pages. A simulation is conducted and the results show the applicability and feasibility of the proposed method.

Kai Chen | Yan Zheng | Xiao-chun Cheng

[1] Shi Zhong-zhi. Web Page Cleaning Technology Based on Web Mining , 2006 .

[2] Chen Hongjian,et al. A Fast Merge Sorting Algorithm Based on LARPBS , 2006 .

[3] Jiawei Han,et al. Data Mining: Concepts and Techniques , 2000 .

[4] Zhang Fa,et al. Parallel divide and conquer bio-sequence comparison based on Smith-Waterman algorithm , 2004 .

[5] Huang Hai,et al. Study of Solution for the Redundancy and Load-sharing of Broadband Access Server , 2005 .