A Fuzzy Clustering Approach to Filter Spam E-Mail

Spam email, is the practice of frequently sending unwanted email messages, usually with commercial content, in large quantities to a set of indiscriminate email accounts. However, since spammers continuously improve their techniques in order to compromise the spam filters, building a spam filter that can be incrementally learned and adapted became an active research field. Researches employed machine learning techniques which have been widely used in solving similar problems like document classification and pattern recognition, such as Naïve Bayesian, and Support Vector Machine. In this Paper, we examine the use of the fuzzy clustering algorithm (Fuzzy C-Means) to build a spam filter. The proposed use of the Fuzzy has been tested on different data set sizes collected from Spam assassin corpora by real user’s emails. After testing Fuzzy C-Means using Heterogeneous Value Difference Metric with variable percentages of spam and using a standard model of assessment for the spam problem, we demonstrate the potential value of our approach.

[1]  Constantine D. Spyropoulos,et al.  An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages , 2000, SIGIR '00.

[2]  Anju Vyas Print , 2003 .

[3]  Lluís Màrquez i Villodre,et al.  Boosting Trees for Anti-Spam Email Filtering , 2001, ArXiv.

[4]  Georgios Paliouras,et al.  An evaluation of Naive Bayesian anti-spam filtering , 2000, ArXiv.

[5]  Susana Nascimento Fuzzy Clustering Via Proportional Membership Model , 2005 .

[6]  Kenneth H. Stokoe,et al.  Proceedings of the World Congress on Engineering 2013, WCE 2013 , 2013 .

[7]  Robert M. Haralick,et al.  Feature normalization and likelihood-based similarity measures for image retrieval , 2001, Pattern Recognit. Lett..

[8]  Susan T. Dumais,et al.  A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.

[9]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[10]  Marti A. Hearst Untangling Text Data Mining , 1999, ACL.

[11]  Tom Fawcett,et al.  "In vivo" spam filtering: a challenge problem for KDD , 2003, SKDD.

[12]  Georgios Paliouras,et al.  A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists , 2004, Information Retrieval.

[13]  Tony R. Martinez,et al.  Improved Heterogeneous Distance Functions , 1996, J. Artif. Intell. Res..

[14]  L. Miles,et al.  2000 , 2000, RDH.

[15]  Yonatan Aumann,et al.  Text Mining via Information Extraction , 1999, PKDD.

[16]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.