Random Forest Technique for E-mail Classification

Email has been an efficient and popular communication mechanism as the number of Internet users increase. Therefore, email management is an important and growing problem for individuals and organizations because it is prone to misuse. The blind posting of unsolicited email messages, known as spam, is an example of misuse. Spam is commonly defined as the sending of unsolicited bulk email that is, email that was not asked for by multiple recipients. The classification algorithms such as Neural Network (NN), Support Vector Machine (SVM), and Naive Bayesian (NB) are currently used in various datasets and showing a good classification result. This paper described classification of emails by Random Forests Technique (RF). RF is ensemble learning technique. A data mining technique called "Ensemble learning" consists of methods that generate many classifiers like decision trees and aggregates the results by taking a weighted vote of their predictions is developed. First the Body of the message is evaluated and after preprocessing the tokens are extracted. Then using a term selection method, the best discriminative terms are retained and other terms are removed. Then iterative patterns are extracted and a feature vector is built for each sample. Finally Random Forest is applied as classifier. If identified category is 0 then it is non-spam otherwise if identified category is 1 then it is spam.