An Efficient Method of Web Page Noise Cleaning for Effective Web Mining

In the huge network of World Wide Web, web pages contained large amount of information. Web researches are always requiring main content (e.g., an article text) from the web pages to be gathered, processed and stored quickly and efficiently. Mining the data on the Web has become a major task for locating useful information from the Web. The Web information„s that are considered as useful information usually has huge amounts of noise data„s such as navigation bars, links, advertisements, copyright notices etc. Performance of Web mining can be improved by identifying and removing noises from Web pages. In this paper new method is proposed for removing noise content tag and extracts the information of main content tag from web pages. General Terms Web Mining, Global Noises, Local Noises, DOM, Web Pages and WWW.

[1]  Nicholas Kushmerick,et al.  Learning to remove Internet advertisements , 1999, AGENTS '99.

[2]  Timo Laakko,et al.  Two approaches to bringing Internet services to WAP devices , 2000, Comput. Networks.

[3]  Brian D. Davison Recognizing Nepotistic Links on the Web , 2000 .

[4]  S. S. Bhamare,et al.  Survey on Web Page Noise Cleaning for Web Mining , 2013 .

[5]  Bing Liu,et al.  Web Page Cleaning for Web Mining through Feature Weighting , 2003, IJCAI.

[6]  Hendrik Blockeel,et al.  Web mining research: a survey , 2000, SKDD.

[7]  Ada Wai-Chee Fu,et al.  Finding Structure and Characteristics of Web Documents for Classification , 2000, ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.

[8]  Li Ming,et al.  Eliminating Noisy Information in Web Pages based on Source Code Shrinking , 2012 .

[9]  Ming-Syan Chen,et al.  Entropy-based link analysis for mining web informative structures , 2002, CIKM '02.

[10]  Hu Fei,et al.  Web Page Noise Reduction Algorithm Using Non-template Approach , 2012 .

[11]  Hongjun Lu,et al.  Toward Learning Based Web Query Processing , 2000, VLDB.

[12]  Jing Li,et al.  Cleaning Web Pages for Effective Web Content Mining , 2006, DEXA.

[13]  A. K. Singh,et al.  An Efficient Method of Eliminating Noisy Information in Web Pages for Data Mining , 2004, CIT.

[14]  A. F. R. Rahman,et al.  Content Extraction from HTML Documents , 2001 .

[15]  Lejian Liao,et al.  A hybrid approach for content extraction with text density and visual importance of DOM nodes , 2013, Knowledge and Information Systems.

[16]  Ming-Syan Chen,et al.  WISDOM: Web intrapage informative structure mining based on document object model , 2005, IEEE Transactions on Knowledge and Data Engineering.