Classifying Spam Emails Using Text and Readability Features

Supervised machine learning methods for classifying spam emails are long-established. Most of these methods use either header-based or content-based features. Spammers, however, can bypass these methods easily-especially the ones that deal with header features. In this paper, we report a novel spam classification method that uses features based on email content-language and readability combined with the previously used content-based task features. The features are extracted from four benchmark datasets viz. CSDMC2010, Spam Assassin, Ling Spam, and Enron-Spam. We use five well-known algorithms to induce our spam classifiers: Random Forest (RF), BAGGING, ADABOOSTM1, Support Vector Machine (SVM), and Naïve Bayes (NB). We evaluate the classifier performances and find that BAGGING performs the best. Moreover, its performance surpasses that of a number of state-of-the-art methods proposed in previous studies. Although applied only to English language emails, the results indicate that our method may be an excellent means to classify spam emails in other languages, as well.

[1]  Georgios Paliouras,et al.  An evaluation of Naive Bayesian anti-spam filtering , 2000, ArXiv.

[2]  Ohm Sornil,et al.  Artificial Immunity-Based Feature Extraction for Spam Detection , 2007, Eighth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing (SNPD 2007).

[3]  Zhen Liu,et al.  A new feature selection algorithm based on binomial hypothesis testing for spam filtering , 2011, Knowl. Based Syst..

[4]  G. Harry McLaughlin,et al.  SMOG Grading - A New Readability Formula. , 1969 .

[5]  M. Basavaraju,et al.  A Novel Method of Spam Mail Detection using Text Based Clustering Approach , 2010 .

[6]  Song Shao-zhong,et al.  The application of particle swarm optimization algorithm in training Forward Neural Network , 2007, Eighth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing (SNPD 2007).

[7]  Jyh-Jian Sheu An Efficient Two-phase Spam Filtering Method Based on E-mails Categorization , 2009, Int. J. Netw. Secur..

[8]  R. P. Fishburne,et al.  Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel , 1975 .

[9]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[10]  Jesper Tegnér,et al.  Consistent Feature Selection for Pattern Recognition in Polynomial Time , 2007, J. Mach. Learn. Res..

[11]  R. Flesch A new readability yardstick. , 1948, The Journal of applied psychology.

[12]  Aziz Qaroush,et al.  Identifying spam e-mail based-on statistical header features and sender behavior , 2012, CUBE.

[13]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[14]  Wagner Meira,et al.  Lazy Associative Classification for Content-based Spam Detection , 2006, 2006 Fourth Latin American Web Congress.

[15]  Gordon V. Cormack,et al.  Spam and the ongoing battle for the inbox , 2007, CACM.

[16]  Blaz Zupan,et al.  Spam Filtering Using Statistical Data Compression Models , 2006, J. Mach. Learn. Res..

[17]  Gordon V. Cormack,et al.  Batch and Online Spam Filter Comparison , 2006, CEAS.

[18]  John S. Caylor,et al.  Methodologies for Determining Reading Requirements Military Occupational Specialties. , 1973 .

[19]  Tang Tao,et al.  A Spam Discrimination Based on Mail Header Feature and SVM , 2008, 2008 4th International Conference on Wireless Communications, Networking and Mobile Computing.

[20]  Enrico Blanzieri,et al.  A survey of learning-based techniques of email spam filtering , 2008, Artificial Intelligence Review.

[21]  Witold R. Rudnicki,et al.  Feature Selection with the Boruta Package , 2010 .

[22]  Vangelis Metsis,et al.  Spam Filtering with Naive Bayes - Which Naive Bayes? , 2006, CEAS.

[23]  Qiao Liu,et al.  Text spam neural network classification algorithm , 2010, 2010 International Conference on Communications, Circuits and Systems (ICCCAS).

[24]  Robert E. Mercer,et al.  Personalized Spam Filtering with Natural Language Attributes , 2013, 2013 12th International Conference on Machine Learning and Applications.

[25]  Chih-Chin Lai,et al.  An empirical performance comparison of machine learning methods for spam e-mail categorization , 2004, Fourth International Conference on Hybrid Intelligent Systems (HIS'04).

[26]  Yong Hu,et al.  A scalable intelligent non-content-based spam-filtering framework , 2010, Expert Syst. Appl..

[27]  Constantin Orasan,et al.  A corpus-based investigation of junk emails , 2002, LREC.

[28]  Tianshun Yao,et al.  An evaluation of statistical spam filtering techniques , 2004, TALIP.

[29]  R. Gunning The Fog Index After Twenty Years , 1969 .