Noise Reduction of Web Pages via Feature Analysis

Noise information has a serious impact on various studies that using web pages as datasets. As a fundamental work in information retrieval, removing noise in web pages quickly and accurately received widely attention. In this paper, a noise reduction algorithm which uses DOM (Document Object Model) to preserve the original structure of web pages is proposed to the issue of low efficiency of traditional noise reduction algorithms. Using this method, noise information can be located rapidly by a combination of several analyzed features, e.g. Link Density and Punctuation Density. The approach is evaluated by a group of web pages that selected randomly from several well-known websites. Experiments show satisfactory results.