Eliminating noisy information in Web pages for data mining

A commercial Web page typically contains many information blocks. Apart from the main content blocks, it usually has such blocks as navigation panels, copyright and privacy notices, and advertisements. We call these blocks that are not the main content blocks of the page the noisy blocks. We show that the information contained in these noisy blocks can seriously harm Web data mining. Eliminating these noises is thus of great importance. In this work, we propose a noise elimination technique. We propose a tree structure, called pattern tree, to capture the common presentation styles and the actual contents of the pages in a given Web site. By sampling the pages of the site, a pattern tree can be built for the site, which we call the site pattern tree (SPT). We then introduce an information-based measure to determine which parts of the SPT represent noises and which parts represent the main contents of the site. The SPT is employed to detect and eliminate noises in any Web page of the site by mapping this page to the SPT. The proposed technique is evaluated by a data-mining task that is Web clustering.

[1]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[2]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[3]  John D. Lafferty,et al.  A Model of Lexical Attraction and Repulsion , 1997, ACL.

[4]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[5]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[6]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[7]  M. KleinbergJon Authoritative sources in a hyperlinked environment , 1999 .

[8]  Nicholas Kushmerick,et al.  Learning to remove Internet advertisements , 1999, AGENTS '99.

[9]  Brian D. Davison Recognizing Nepotistic Links on the Web , 2000 .

[10]  Tok Wang Ling,et al.  IntelliClean: a knowledge-based intelligent data cleaner , 2000, KDD '00.

[11]  Un Yong Nahm and Mikhail Bilenko and Raymond J. Mooney,et al.  Two Approaches to Handling Noisy Variation in Text Mining , 2002 .

[12]  Jan-Ming Ho,et al.  Discovering informative content blocks from Web documents , 2002, KDD.

[13]  Ming-Syan Chen,et al.  Entropy-based link analysis for mining web informative structures , 2002, CIKM '02.

[14]  Jiawei Han,et al.  Data Mining for Web Intelligence , 2002, Computer.

[15]  Ziv Bar-Yossef,et al.  Template detection via data mining and its applications , 2002, WWW.

[16]  Alejandro A. Vaisman,et al.  Enhancing Web access using data mining techniques , 2003, 14th International Workshop on Database and Expert Systems Applications, 2003. Proceedings..

[17]  Chaomei Chen,et al.  Mining the Web: Discovering knowledge from hypertext data , 2004, J. Assoc. Inf. Sci. Technol..

[18]  Dik Lun Lee,et al.  Clustering search engine query log containing noisy clickthroughs , 2004, 2004 International Symposium on Applications and the Internet. Proceedings..

[19]  John D. Lafferty,et al.  Statistical Models for Text Segmentation , 1999, Machine Learning.