Spam Feature Selection Based on the Improved Mutual Information Algorithm

Content-based spam filtering technologies generally use feature selection algorithm for mail classification. Based on the mutual information feature selection algorithm, this paper proposes an improved mutual information method with frequency (MIf) by introducing the word frequency factor, and an improved mutual information method with average frequency (MIaf) by introducing the word average frequency factor. Simulation experiments are conducted based on the English corpus (PU1's lemm_stop) and Chinese corpus CCERT email data set, the feature subsets are extracted through the improved algorithms, and the mails are classified by the Naïve Bayes algorithm. The experimental results show that the improved mutual information algorithms can select better feature subsets and enhance the mail classification effects.