A hybrid learning algorithm for text classification

ABSTRACT Text classification is the process of classifying documents into predefined categories based on their content. Existing supervised learning algorithms to automatically classify text need sufficient documents to learn accurately. This paper presents a new algorithm for text classification that requires fewer documents for training. Instead of using words, word relation i.e association rules from these words is used to derive feature set from pre-classified text documents. The concept of Naive Bayes classifier is then used on derived features and finally only a single concept of Genetic Algorithm has been added for final classification. Experimental results show that the classifier build this way is more accurate than the existing text classification systems. 1. INTRODUCTION Text classification has become one of the most important techniques in the text data mining. The task is to automatically classify documents into predefined classes based on their content. Many algorithms have been developed to deal with automatic text classification [3]. With the existing algorithms, a number of newly established processes are involving in the automation of text classification. It has been observed that for the purpose of text classification the concept of association rule is very well known. Association rule mining [1] finds interesting association or correlation relationships among a large set of data items [4]. The discovery of these relationships among huge amounts of transaction records can help in many decision making process. On the other hand, the Naive Bayes classifier uses the maximum a posteriori estimation for learning a classifier. It assumes that the occurrence of each word in a document is conditionally independent of all other words in that document given its class [3]. Although the Naive Bayes works well in many studies [7], it requires a large number of training documents for learning accurately. Genetic algorithm starts with an initial population which is created consisting of randomly generated rules. Each rule can be represented by a string of bits. Based on the notion of survival of the fittest, a new population is formed to consist of the fittest rules in the current population, as well as offspring of these rules. Typically, the fitness of a rule is assessed by its classification accuracy on a set of training examples. This paper presents a new algorithm for text classification. Instead of using words, word relation i.e association rules is used to derive feature set from pre-classified text documents. The concept of Naive Bayes Classifier is then used on derived features and finally a concept of Genetic Algorithm has been added for final classification. A system based on the proposed algorithm has been implemented and tested. The experimental results show that the proposed system works as a successful text classifier.