Classification of Web documents using a naive Bayes method

This paper presents an automatic document classification system, WebDoc, which classifies Web documents according to the Library of Congress classification scheme. WebDoc constructs a knowledge base from the training data and then classifies the documents based on information in the knowledge base. One of the classification algorithms used in WebDoc is based on Bayes' theorem from probability theory. This paper focuses upon three aspects of this approach: different event models for the naive Bayes method, different probability smoothing methods, and different feature selection methods. In this paper, we report the performance of each method in terms of recall, precision, and F-measures. Experimental results show that the WebDoc system can classify Web documents effectively and efficiently.

[1]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[2]  William A. Gale,et al.  Good-Turing Frequency Estimation Without Tears , 1995, J. Quant. Linguistics.

[3]  Howard C. Card,et al.  An adaptive neural network approach to hypertext clustering , 1999, IJCNN'99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339).

[4]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[5]  J. Farkas Neural networks and document classification , 1993, Proceedings of Canadian Conference on Electrical and Computer Engineering.

[6]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[7]  Gerard Salton,et al.  A blueprint for automatic indexing , 1981, SIGF.

[8]  Anil K. Jain,et al.  Classification of text documents , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[9]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[10]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[11]  Ah-Hwee Tan,et al.  Machine Learning Methods for Chinese Web page Categorization , 2000, ACL 2000.

[12]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[13]  William A. Gale,et al.  Good-Turing Smoothing Without Tears , 2001 .

[14]  J. Farkas Improving the classification accuracy of automatic text processing systems using context vectors and back-propagation algorithms , 1996, Proceedings of 1996 Canadian Conference on Electrical and Computer Engineering.

[15]  Wai Lam,et al.  Automatic document classification based on probabilistic reasoning: model and performance analysis , 1997, 1997 IEEE International Conference on Systems, Man, and Cybernetics. Computational Cybernetics and Simulation.

[16]  Jung-Hyun Lee,et al.  A Bayesian neural network model for dynamic web document clustering , 1999, Proceedings of IEEE. IEEE Region 10 Conference. TENCON 99. 'Multimedia Technology for Asia-Pacific Information Infrastructure' (Cat. No.99CH37030).

[17]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[18]  S.J. Cunningham,et al.  Applying machine learning to subject classification and subject description for information retrieval , 1995, Proceedings 1995 Second New Zealand International Two-Stream Conference on Artificial Neural Networks and Expert Systems.