Overview of textual anti-spam filtering techniques

Elecronic mail (E-mail) is an essential communication tool that has been greatly abused by spammers to disseminate unwanted information (messages) and spread malicious contents to Internet users. Current Internet technologies further accelerated the distribution of spam. Effective controls need to be deployed to countermeasure the ever growing spam problem. Machine learning provides better protective mechanisms that are able to control spam. This paper summarizes most common techniques used for anti-spam filtering by analyzing the e-mail content and also looks into machine learning algorithms such as Naive Bayesian, support vector machine and neural network that have been adopted to detect and control spam. Each machine learning has its own strengths and limitations as such appropriate preprocessing need to be carefully considered to increase the effectiveness of any given machine learning.   Key words: Anti-spam filters, text categorization, electronic mail (E-mail), machine learning.

[1]  Akira Hara,et al.  A classification method for spam e-mail by Self-Organizing Map and automatically defined groups , 2007, 2007 IEEE International Conference on Systems, Man and Cybernetics.

[2]  Haiyan Wang,et al.  An Anti-spam Filtering System Based on the Naive Bayesian Classifier and Distributed Checksum Clearinghouse , 2009, 2009 Third International Symposium on Intelligent Information Technology Application.

[3]  Walmir M. Caminhas,et al.  A review of machine learning approaches to Spam filtering , 2009, Expert Syst. Appl..

[4]  Irina Rish,et al.  An empirical study of the naive Bayes classifier , 2001 .

[5]  Dennis McLeod,et al.  Efficient Spam Email Filtering using Adaptive Ontology , 2007, Fourth International Conference on Information Technology (ITNG'07).

[6]  Nguyen Ngoc Binh,et al.  Vietnamese spam detection based on language classification , 2008, 2008 Second International Conference on Communications and Electronics.

[7]  Bo Yu,et al.  A comparative study for content-based dynamic spam classification using four machine learning algorithms , 2008, Knowl. Based Syst..

[8]  Muhammad E. Shaaban,et al.  Identifying junk electronic mail in Microsoft outlook with a support vector machine , 2003, 2003 Symposium on Applications and the Internet, 2003. Proceedings..

[9]  Nizar Bouguila,et al.  Online spam filtering using support vector machines , 2009, 2009 IEEE Symposium on Computers and Communications.

[10]  Gareth J. F. Jones,et al.  Using online linear classifiers to filter spam emails , 2006, Pattern Analysis and Applications.

[11]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[12]  Yan-Shi Dong,et al.  A comparison of several ensemble methods for text categorization , 2004, IEEE International Conference onServices Computing, 2004. (SCC 2004). Proceedings. 2004.

[13]  Zili Zhang,et al.  An email classification model based on rough set theory , 2005, Proceedings of the 2005 International Conference on Active Media Technology, 2005. (AMT 2005)..

[14]  Simon Heron Spam Detection: Technologies for spam detection , 2009 .

[15]  Lu Xianliang,et al.  A LVQ-based neural network anti-spam email approach , 2005 .

[16]  Hou-Kuan Huang,et al.  Active learning with simplified SVMs for spam categorization , 2002, Proceedings. International Conference on Machine Learning and Cybernetics.

[17]  Ee-Peng Lim,et al.  Web classification using support vector machine , 2002, WIDM '02.

[18]  David M. Dutton,et al.  A review of machine learning , 1997, The Knowledge Engineering Review.

[19]  Stephen E. Robertson,et al.  Understanding inverse document frequency: on theoretical arguments for IDF , 2004, J. Documentation.

[20]  Chih-Chin Lai,et al.  An empirical study of three machine learning methods for spam filtering , 2007, Knowl. Based Syst..

[21]  N. Soonthornphisaj,et al.  Anti-spam filtering: a centroid-based classification approach , 2002, 6th International Conference on Signal Processing, 2002..

[22]  Chunhua Zhang,et al.  Spam filtering with several novel bayesian classifiers , 2008, 2008 19th International Conference on Pattern Recognition.

[23]  Krerk Piromsopa,et al.  Statistical Rules for Thai Spam Detection , 2010, 2010 Second International Conference on Future Networks.

[24]  Tony White,et al.  Increasing the accuracy of a spam-detecting artificial immune system , 2003, The 2003 Congress on Evolutionary Computation, 2003. CEC '03..

[25]  Tsuhan Chen,et al.  A collaborative anti-spam system , 2009, Expert Syst. Appl..

[26]  Jiang Wei,et al.  A Chinese Anti-Spam Filter Approach Based on Support Vector Machine , 2007, 2007 International Conference on Management Science and Engineering.

[27]  Chih-Hung Wu,et al.  Behavior-based spam detection using a hybrid method of rule-based techniques and neural networks , 2009, Expert Syst. Appl..

[28]  Aixin Sun,et al.  Web Classication Using Support Vector Machine , 2002 .

[29]  Heng Yin,et al.  An effective defense against email spam laundering , 2006, CCS '06.

[30]  Xiangzhou Zhang,et al.  An Intelligent Spam Filtering System Based on Fuzzy Clustering , 2009, 2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery.

[31]  Du Zhang,et al.  Some empirical results on two spam detection methods , 2004, Proceedings of the 2004 IEEE International Conference on Information Reuse and Integration, 2004. IRI 2004..

[32]  M. Y. Schaub Unsolicited Email: Does Europe allow Spam? The State of the Art of the European Legislation with regard to Unsolicited Commercial Communications , 2002, Comput. Law Secur. Rev..

[33]  Daniel Hernández-Lobato,et al.  Bayes Machines for binary classification , 2008, Pattern Recognit. Lett..

[34]  Yan Xu,et al.  An Online Linear Chinese Spam Emails Filtering System , 2010, 2010 2nd International Conference on E-business and Information System Security.

[35]  Georgios Paliouras,et al.  An evaluation of Naive Bayesian anti-spam filtering , 2000, ArXiv.

[36]  Ziqiang Wang,et al.  Using LPP and LS-SVM for spam filtering , 2009, 2009 ISECS International Colloquium on Computing, Communication, Control, and Management.

[37]  Xu Zhou,et al.  A LVQ-based neural network anti-spam email approach , 2005, OPSR.

[38]  Levent Özgür,et al.  Spam Mail Detection Using Artificial Neural Network and Bayesian Filter , 2004, IDEAL.

[39]  Gökhan Dalkiliç,et al.  A simple yet effective spam blocking method , 2009, SIN '09.

[40]  Li Ren,et al.  Bayesian Chinese Spam Filter Based on Crossed N-gram , 2006, Sixth International Conference on Intelligent Systems Design and Applications.

[41]  Agostino Poggi,et al.  A collaborative and multi-agent approach to e-mail filtering , 2005, IEEE/WIC/ACM International Conference on Intelligent Agent Technology.

[42]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[43]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[44]  Yue Yang,et al.  Anti-Spam Filtering Using Neural Networks and Baysian Classifiers , 2007, 2007 International Symposium on Computational Intelligence in Robotics and Automation.

[45]  Fei Cheng,et al.  A Bayesian approach to support vector machines for the binary classification , 2008, Neurocomputing.

[46]  Miao Ye,et al.  The Spam Filtering Technology Based on SVM and D-S Theory , 2008, First International Workshop on Knowledge Discovery and Data Mining (WKDD 2008).

[47]  Irena Koprinska,et al.  A neural network based approach to automated e-mail classification , 2003, Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003).

[48]  Susan T. Dumais,et al.  A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.

[49]  R.F. Erbacher,et al.  An Evaluation of Naïve Bayesian Anti-Spam Filtering Techniques , 2007, 2007 IEEE SMC Information Assurance and Security Workshop.

[50]  Kim Frost,et al.  Anti spam regulation in Denmark , 2006, Comput. Law Secur. Rev..

[51]  ScienceDirect Computer law and security report , 2008 .

[52]  Penny Duquenoy,et al.  Combating Spam through Legislation: A Comparative Analysis of US and European Approaches , 2005, CEAS.

[53]  Ashutosh Tiwari,et al.  A review of soft computing applications in supply chain management , 2010, Appl. Soft Comput..

[54]  Jonathan A. Zdziarski,et al.  Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification , 2005 .

[55]  Tao Deng,et al.  Research in Anti-Spam Method Based on Bayesian Filtering , 2008, 2008 IEEE Pacific-Asia Workshop on Computational Intelligence and Industrial Application.

[56]  Tianshun Yao,et al.  An evaluation of statistical spam filtering techniques , 2004, TALIP.

[57]  R. D. Goyal Knowledge Based Neural Network for Text Classification , 2007 .