Web Page Classification: A Soft Computing Approach

The Internet makes it possible to share and manipulate a vast quantity of information efficiently and effectively, but the rapid and chaotic growth experienced by the Net has generated a poorly organized environment that hinders the sharing and mining of useful data. The need for meaningful web-page classification techniques is therefore becoming an urgent issue. This paper describes a novel approach to web-page classification based on a fuzzy representation of web pages. A doublet representation that associates a weight with each of the most representative words of the web document so as to characterize its relevance in the document. This weight is derived by taking advantage of the characteristics of HTML language. Then a fuzzy-rule-based classifier is generated from a supervised learning process that uses a genetic algorithm to search for the minimum fuzzy-rule set that best covers the training examples. The proposed system has been demonstrated with two significantly different classes of web pages.

[1]  Yiming Yang,et al.  A Study of Approaches to Hypertext Categorization , 2002, Journal of Intelligent Information Systems.

[2]  Karen Coppock,et al.  E‐Commerce and Development Report 2001 , 2003 .

[3]  D. Dasgupta,et al.  Evolving complex fuzzy classifier rules using a linear tree genetic representation , 2001 .

[4]  John M. Pierre,et al.  On the Automated Classification of Web Sites , 2001, ArXiv.

[5]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[6]  Hendrik Blockeel,et al.  Web mining research: a survey , 2000, SKDD.

[7]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[8]  Lilach Nachum United Nations Conference on Trade and Development (UNCTAD): World Investment Report 2000: Cross-Border Mergers and Acquisitions and Development United Nations, New York and Geneva 2000. (Biblio Service) , 2001 .

[9]  Angela Ribeiro,et al.  A Fuzzy System for the Web Page Representation , 2003, Intelligent Exploration of the Web.

[10]  Zbigniew Michalewicz,et al.  Genetic Algorithms + Data Structures = Evolution Programs , 1996, Springer Berlin Heidelberg.

[11]  Dr. Alex A. Freitas Data Mining and Knowledge Discovery with Evolutionary Algorithms , 2002, Natural Computing Series.

[12]  Vijay V. Raghavan,et al.  Information Retrieval on the World Wide Web , 1997, IEEE Internet Comput..

[13]  David B. Lomet,et al.  Bulletin of the Technical Committee on Data Engineering Special Issue on Data Reduction Techniques Announcements and Notices Letter from the Editor-in-chief 1 Technical Committee Election Changing Editorial Staa Letter from the Special Issue Editor the New Jersey Data Reduction Report , 2022 .

[14]  Susan T. Dumais,et al.  Bringing order to the Web: automatically categorizing search results , 2000, CHI.

[15]  Janusz Kacprzyk,et al.  Intelligent Exploration of the Web , 2003, Studies in Fuzziness and Soft Computing.