Survey on Web Page Noise Cleaning for Web Mining

Web Page Noise Cleaning is one of the new research area of study for removing the noise patterns of web pages for effective web mining. The World Wide Web contains large amount of web pages which are accessible by users. With conventional data or text, Web pages generally contain a large amount of noise information that is not part of the main contents of the web pages, e.g., advertisement banners, navigation bars, and disclaimer/copyright notices. The main objective of this area is removing such irrelevant information (i.e. Web Page Noise or Local Noise) in Web pages that can seriously harm Web mining task such as clustering and classification etc. The main purpose of this paper is to review and discuss the major research work that has been done in this area and identifying the challenges and issues in this area. Keywords— WWW, Web Page Cleaning, Noise Block, DOM Tree, Web Mining, Web pages.

[1]  Ming-Syan Chen,et al.  WISDOM: Web intrapage informative structure mining based on document object model , 2005, IEEE Transactions on Knowledge and Data Engineering.

[2]  Jan-Ming Ho,et al.  Discovering informative content blocks from Web documents , 2002, KDD.

[3]  A. F. R. Rahman,et al.  Content Extraction from HTML Documents , 2001 .

[4]  A. K. Singh,et al.  An Efficient Method of Eliminating Noisy Information in Web Pages for Data Mining , 2004, CIT.

[5]  HongJiang Zhang,et al.  HTML page analysis based on visual cues , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[6]  Khin Haymar Saw Hla,et al.  Noise removing from Web pages using neural network , 2010, 2010 The 2nd International Conference on Computer and Automation Engineering (ICCAE).

[7]  Gail E. Kaiser,et al.  DOM-based content extraction of HTML documents , 2003, WWW '03.

[8]  Hendrik Blockeel,et al.  Web mining research: a survey , 2000, SKDD.

[9]  Timo Laakko,et al.  Two approaches to bringing Internet services to WAP devices , 2000, Comput. Networks.

[10]  Brian D. Davison Recognizing Nepotistic Links on the Web , 2000 .

[11]  Ada Wai-Chee Fu,et al.  Finding Structure and Characteristics of Web Documents for Classification , 2000, ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.

[12]  Sandip Debnath,et al.  Automatic extraction of informative blocks from webpages , 2005, SAC '05.

[13]  Hongjun Lu,et al.  Toward Learning Based Web Query Processing , 2000, VLDB.

[14]  John R. Smith,et al.  Detecting image purpose in World Wide Web documents , 1998, Electronic Imaging.

[15]  Yong Zhang,et al.  Algorithm of web page purification based on improved DOM and statistical learning , 2010, 2010 International Conference On Computer Design and Applications.

[16]  Nicholas Kushmerick,et al.  Learning to remove Internet advertisements , 1999, AGENTS '99.

[17]  Jing Li,et al.  Cleaning Web Pages for Effective Web Content Mining , 2006, DEXA.

[18]  Ming-Syan Chen,et al.  Entropy-based link analysis for mining web informative structures , 2002, CIKM '02.

[19]  Ziv Bar-Yossef,et al.  Template detection via data mining and its applications , 2002, WWW.

[20]  Wei-Ying Ma,et al.  VIPS: a Vision-based Page Segmentation Algorithm , 2003 .

[21]  Bing Liu,et al.  Web Page Cleaning for Web Mining through Feature Weighting , 2003, IJCAI.