A Novel Strategy for a Vertical Web Page Classifier Based on Continuous Learning Naïve Bayes Algorithm

Abstract Web page classification may be considered as a one of the most challenging research areas. Where the web has a huge volume of unstructured documents of distributed data related to a variety of domains; so, considering one base for the classification task will be extremely difficult. In addition, the web is full of noise that will certainly harm the classifier performance especially if it is found in the classifier training data. Generally, it will be more valued to build a domain-oriented classifiers (vertical classifiers) to classify pages related to a specific domain. This paper analyzes a new way of applying Bayes theorem to build a Domain-Oriented Naive Bayes (DONB) classifier. In addition, a main contribution is to introduce a novel classification strategy by adding the continuous learning ability to bayes theorem to build a Continuous Learning Naive Bayes (CLNB) classifier. Where the overfitting problem has a great impact on most web page classification techniques, continuous learning can be considered as a proposed solution, it allows the classifier to adapt itself continuously for achieving better performance. Both classifiers are tested; experimental results have shown that CLNB demonstrate significant performance improvement over DONB , where its accuracy reaches 94.1% after testing 1000 page. In addition, according to continuous learning, more accuracy enhancement is predicted during future tests.

[1]  Ee-Peng Lim,et al.  Web classification using support vector machine , 2002, WIDM '02.

[2]  Gökhan Tür,et al.  Combining active and semi-supervised learning for spoken language understanding , 2005, Speech Commun..

[3]  Susan T. Dumais,et al.  Bringing order to the Web: automatically categorizing search results , 2000, CHI.

[4]  Fabio Ciravegna,et al.  Adaptive Information Extraction from Text by Rule Induction and Generalisation , 2001, IJCAI.

[5]  Jiawei Han,et al.  Classifying large data sets using SVMs with hierarchical clusters , 2003, KDD '03.

[6]  Wai Lam,et al.  Automatic Textual Document Categorization Based on Generalized Instance Sets and a Metamodel , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Alessandro Sperduti,et al.  Speed up learning and network optimization with extended back propagation , 1993, Neural Networks.

[8]  Ali Selamat,et al.  Web news classification using neural networks based on PCA , 2002, Proceedings of the 41st SICE Annual Conference. SICE 2002..

[9]  Padmini Srinivasan,et al.  Hierarchical Text Categorization Using Neural Networks , 2004, Information Retrieval.

[10]  Tsong Yueh Chen,et al.  On the statistical properties of the F-measure , 2004, Fourth International Conference onQuality Software, 2004. QSIC 2004. Proceedings..

[11]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[12]  Ronald L. Rivest,et al.  Training a 3-node neural network is NP-complete , 1988, COLT '88.

[13]  C. Apte,et al.  Data mining with decision trees and decision rules , 1997, Future Gener. Comput. Syst..

[14]  Jong-Hyeok Lee,et al.  Text categorization based on k-nearest neighbor approach for Web site classification , 2003, Inf. Process. Manag..

[15]  John M. Pierre,et al.  Practical Issues for Automated Categorization of Web Sites , 2000 .

[16]  Eytan Ruppin,et al.  Unsupervised learning of natural languages , 2006 .

[17]  Marcia J. Bates,et al.  Subject access in online catalogs: A design model , 1986 .

[18]  David R. Karger,et al.  Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[20]  Staðlaráð Íslands,et al.  Heimildaskráning : leiðbeiningar um gerð og þróun kerfisbundinna efnisorðaskráa á einu tungumáli = Documentation : guidelines for the establishment and development of monolingual thesauri , 1991 .

[21]  Ryusuke KURINO,et al.  Growing neural network with hidden neurons , 2003 .

[22]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[23]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[24]  Min Zhao,et al.  Ranking definitions with supervised learning methods , 2005, WWW '05.

[25]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.