The problem of spam has been seriously troubling the Internet community during the last few years and currently reached an alarming scale. Observations made at CERN (European Organization for Nuclear Research located in Geneva, Switzerland) show that spam mails can constitute up to 75% of daily SMTP traffic. A naive Bayesian classifier based on a Bag of Words representation of an email is widely used to stop this unwanted flood as it combines good performance with simplicity of the training and classification processes. However, facing the constantly changing patterns of spam, it is necessary to assure online adaptability of the classifier. This work proposes combining such a classifier with another NBC (naive Bayesian classifier) based on pairs of adjacent words. Only the latter will be retrained with examples of spam reported by users. Tests are performed on considerable sets of mails both from public spam archives and CERN mailboxes. They suggest that this architecture can increase spam recall without affecting the classifier precision as it happens when only the NBC based on single words is retrained. A reevaluation of algorithm's implementation and performance is effectuated from the perspective of over a year. Keywords—Text classification, naive Bayesian classification, spam, email
[1]
Henry Stern,et al.
Optimising Näıve Bayesian Networks for Spam Detection
,
2002
.
[2]
Georgios Paliouras,et al.
An evaluation of Naive Bayesian anti-spam filtering
,
2000,
ArXiv.
[3]
Jeffrey O. Kephart,et al.
SpamGuru: An Enterprise Anti-Spam Filtering System
,
2004,
CEAS.
[4]
S. Bartolomé-Jiménez,et al.
European Organization for Nuclear Research
,
1954,
Nature.
[5]
Nathaniel S. Borenstein,et al.
Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies
,
1996,
RFC.
[6]
George Kingsley Zipf,et al.
Human behavior and the principle of least effort
,
1949
.
[7]
Yuan-Fang Wang,et al.
The use of bigrams to enhance text categorization
,
2002,
Inf. Process. Manag..
[8]
Michael. J. Fromberger.
Bayesian Classification of Unsolicited E-Mail
,
2004
.
[9]
Pedro M. Domingos,et al.
Adversarial classification
,
2004,
KDD.
[10]
Thomas G. Dietterich.
What is machine learning?
,
2020,
Archives of Disease in Childhood.