A fuzzy similarity approach for automated spam filtering

E-mail spam has become an epidemic problem that can negatively affect the usability of electronic mail as a communication means. Besides wasting users' time and effort to scan and delete the massive amount of junk e-mails received; it consumes network bandwidth and storage space, slows down e-mail servers, and provides a medium to distribute harmful and/or offensive content. Several machine learning approaches have been applied to this problem. In this paper, we explore a new approach based on fuzzy similarity that can automatically classify e-mail messages as spam or legitimate. We study its performance for various conjunction and disjunction operators for several datasets. The results are promising as compared with a naive Bayesian classifier. Classification accuracy above 97% and low false positive rates are achieved in many test cases.

[1]  William S. Yerazunis Sparse Binary Polynomial Hashing and the CRM114 Discriminator , 2006 .

[2]  Bogdan Hoanca,et al.  How good are our weapons in the spam wars? , 2006, IEEE Technology and Society Magazine.

[3]  Ray Hunt,et al.  Current and New Developments in Spam Filtering , 2006, 2006 14th IEEE International Conference on Networks.

[4]  Pat Langley,et al.  An Analysis of Bayesian Classifiers , 1992, AAAI.

[5]  Irena Koprinska,et al.  A neural network based approach to automated e-mail classification , 2003, Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003).

[6]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[7]  Masaaki Tanaka,et al.  Bayesian Spam Filterを用いた要約の自動分類の試み , 2006 .

[8]  Debzani Deb,et al.  A Trainable Fuzzy Spam Detection System , 2004 .

[9]  Mikko T. Siponen,et al.  Effective Anti-Spam Strategies in Companies: An International Study , 2006, Proceedings of the 39th Annual Hawaii International Conference on System Sciences (HICSS'06).

[10]  Georgios Paliouras,et al.  Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach , 2000, ArXiv.

[11]  Ridvan Saraçoglu,et al.  A fuzzy clustering approach for finding similar documents using a novel similarity measure , 2007, Expert Syst. Appl..

[12]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[13]  Constantine D. Spyropoulos,et al.  An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages , 2000, SIGIR '00.

[14]  William W. Cohen Learning Rules that Classify E-Mail , 1996 .

[15]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[16]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[17]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[18]  Georgios Paliouras,et al.  A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists , 2004, Information Retrieval.

[19]  KarkaletsisVangelis,et al.  A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists , 2003 .

[20]  G. Manning The use of the DAP, a massively parallel computing system, for information retrieval and processing , 1989 .

[21]  Susan T. Dumais,et al.  A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.

[22]  John Yen,et al.  A fuzzy similarity approach in text classification task , 2000, Ninth IEEE International Conference on Fuzzy Systems. FUZZ- IEEE 2000 (Cat. No.00CH37063).

[23]  Dave C. Trudgian Spam Classification Using Nearest Neighbour Techniques , 2004, IDEAL.