Feature Selection Algorithms to Improve Documents' Classification Performance

This paper presents a study where feature selection algorithms were evaluated in order to improve documents' classification performance. The study was made during the project DEEPSIA, IST project Nr. 1999-20 283, funded by the European Union. The need to improve documents recognition was imposed by the need to increase the overall performance of the Framework for Internet data collection based on intelligent agents, used within the project. The Framework is briefly described and the learning techniques used are presented. The focus of this paper is on the feature selection algorithms, where the most relevant work was the use of Conditional Mutual Information, estimated using genetic algorithms, since the computational complexity of CKN invalidated an iterative approach. Methods, techniques and comparative results are presented in detail.

[1]  Albert Bokma,et al.  CogNet: Integrated information and knowledge management and its use in virtual organisations , 2000, E-Business and Virtual Enterprises.

[2]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[3]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[4]  Rainer Hoch,et al.  On the evaluation of document analysis components by recall, precision, and accuracy , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).