Spam Feature Selection Based on the Improved Mutual Information Algorithm
暂无分享,去创建一个
Content-based spam filtering technologies generally use feature selection algorithm for mail classification. Based on the mutual information feature selection algorithm, this paper proposes an improved mutual information method with frequency (MIf) by introducing the word frequency factor, and an improved mutual information method with average frequency (MIaf) by introducing the word average frequency factor. Simulation experiments are conducted based on the English corpus (PU1's lemm_stop) and Chinese corpus CCERT email data set, the feature subsets are extracted through the improved algorithms, and the mails are classified by the Naïve Bayes algorithm. The experimental results show that the improved mutual information algorithms can select better feature subsets and enhance the mail classification effects.
[1] Wang Juan,et al. Feature Selection on Text Categorization , 2003 .
[2] Yiming Yang,et al. A re-examination of text categorization methods , 1999, SIGIR '99.
[3] Song Han-tao,et al. Feature Selection in Text Categorization , 2004 .
[4] Patrick Pantel,et al. SpamCop: A Spam Classification & Organisation Program , 1998, AAAI 1998.
[5] Susan T. Dumais,et al. A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.