论文信息 - Detecting and Removing Noisy Data on Web Document using Text Density Approach

Detecting and Removing Noisy Data on Web Document using Text Density Approach

The web documents content are useful resources for many applications. However, this content could be classified into relevant content and irrelevant content with respect to the involved application. The irrelevant content, like advertisements banner, copyright information, and navigation menus assumed as noisy data. Noisy data that found among the content of the web document affects negatively the performance of most of applications that deals with the content of web pages. The process of detecting and removing noisy data is an important pre-processing step in many applications such as web page classifications, clustering of web pages and information retrieval tasks. We developed a unified algorithm able to detect automatically the noisy data and eliminate them out of the web page and produce a clear web document that could be used effectively in later steps. The suggested approach examined using a dataset composed of different classes. The results of the conducted experiments showed a significant enhancement in the problem of detecting and removing noisy.

Hassan F. Eldirdiery | A. H. Ahmed

[1] Yeliz Yesilada,et al. Vision Based Page Segmentation: Extended and Improved Algorithm , 2014 .

[2] Eduardo Sany Laber,et al. A fast and simple method for extracting relevant content from news webpages , 2009, CIKM.

[3] Wolfgang Nejdl,et al. A densitometric approach to web page segmentation , 2008, CIKM '08.

[5] A. K. Singh,et al. An Efficient Method of Eliminating Noisy Information in Web Pages for Data Mining , 2004, CIT.

[6] Ziv Bar-Yossef,et al. Template detection via data mining and its applications , 2002, WWW.

[7] Amit Dutta,et al. Noise Elimination from Web Page Based on Regular Expressions for Web Content Mining , 2014 .

[8] Efstathios Stamatatos,et al. Extracting informative textual parts from web pages containing user-generated content , 2012, i-KNOW '12.

[9] Vijay Katiyar,et al. A Noise Reduction Approach based on n x 1 Table and XSL Display Method for Efficient Web Data Extraction , 2013 .

[10] Juliana Freire,et al. A fast and robust method for web page template detection and removal , 2006, CIKM '06.

[11] Andrew Tomkins,et al. The volume and evolution of web page templates , 2005, WWW '05.

[12] Sandip Debnath,et al. Automatic extraction of informative blocks from webpages , 2005, SAC '05.

[13] Wei-Ying Ma,et al. Learning block importance models for web pages , 2004, WWW '04.