Filtering noise in Web pages based on parsing tree

Abstract This paper proposes a novel method to filter web pages using parsing tree. Firstly, this paper explains how features of noises in web pages can be analyzed and extracted. Secondly, this paper explains how the parsing tree of the web pages can be built using document object model (DOM). Finally, this paper explains how domain specific extraction rules and statistic methods can be deployed to eliminate noises and to extract main texts from the web pages. A simulation is conducted and the results show the applicability and feasibility of the proposed method.