Analyzing the Impact of Corpus Preprocessing on AntiSpam Filtering Software
暂无分享,去创建一个
Because of the volume of spam e-mail and its evolving nature, many statistical techniques have been applied until now for the construction of antispam filtering software. In order to train and test filters, it is necessary to have a large e-mail corpus. In this paper we discuss several considerations that researchers must take into account when building and processing a corpus. After reviewing several text preprocessing methods used on spam filtering, we show the results obtained by different machine and lazy learning approaches when the preprocessing of the training corpus changes. The results obtained from the experiments carried out are very informative and they back up the idea that instance-based reasoning systems can offer significant advantages in the spam filtering domain