Efficient email classification approach based on semantic methods

Abstract Emails have become one of the major applications in daily life. The continuous growth in the number of email users has led to a massive increase of unsolicited emails, which are also known as spam emails. Managing and classifying this huge number of emails is an important challenge. Most of the approaches introduced to solve this problem handled the high dimensionality of emails by using syntactic feature selection. In this paper, an efficient email filtering approach based on semantic methods is addressed. The proposed approach employs the WordNet ontology and applies different semantic based methods and similarity measures for reducing the huge number of extracted textual features, and hence the space and time complexities are reduced. Moreover, to get the minimal optimal features’ set, feature dimensionality reduction has been integrated using feature selection techniques such as the Principal Component Analysis (PCA) and the Correlation Feature Selection (CFS). Experimental results on the standard benchmark Enron Dataset showed that the proposed semantic filtering approach combined with the feature selection achieves high computational performance at high space and time reduction rates. A comparative study for several classification algorithms indicated that the Logistic Regression achieves the highest accuracy compared to Naive Bayes, Support Vector Machine, J48, Random Forest, and radial basis function networks. By integrating the CFS feature selection technique, the average recorded accuracy for the all used algorithms is above 90%, with more than 90% feature reduction. Besides, the conducted experiments showed that the proposed work has a highly significant performance with higher accuracy and less time compared to other related works.

[1]  Aakanksha Sharaff,et al.  Comparative Study of Classification Algorithms for Spam Email Detection , 2016 .

[2]  Yuan Tian,et al.  Semantic dictionary based method for short text classification , 2013 .

[3]  Florentina Hristea Semantic WordNet-Based Feature Selection , 2013 .

[4]  Fadi Thabtah,et al.  An Experimental Study for Assessing Email Classification Attributes Using Feature Selection Methods , 2014, 2014 3rd International Conference on Advanced Computer Science Applications and Technologies.

[5]  Robert E. Mercer,et al.  Classifying Spam Emails Using Text and Readability Features , 2013, 2013 IEEE 13th International Conference on Data Mining.

[6]  Ted Pedersen,et al.  Evaluating measures of semantic similarity and relatedness to disambiguate terms in biomedical text , 2013, J. Biomed. Informatics.

[7]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[8]  Robert E. Mercer,et al.  Supervised classification of spam emails with natural language stylometry , 2015, Neural Computing and Applications.

[9]  Amit Kumar Sharma,et al.  Spam Mails Filtering Using Different Classifiers with Feature Selection and Reduction Technique , 2015, 2015 Fifth International Conference on Communication Systems and Network Technologies.

[10]  Ibrahim F. Moawad,et al.  Semantic-Based Feature Reduction Approach for E-mail Classification , 2016, AISI.

[11]  Dong Seong Kim,et al.  Spam Detection Using Feature Selection and Parameters Optimization , 2010, 2010 International Conference on Complex, Intelligent and Software Intensive Systems.

[12]  Jianguo Ding,et al.  An efficient semantic VSM based email categorization method , 2010, 2010 International Conference on Computer Application and System Modeling (ICCASM 2010).

[13]  Eman M. Bahgat,et al.  An E-mail Filtering Approach Using Classification Techniques , 2015, AISI.

[14]  Wei Hu,et al.  Spam filtering by semantics-based text classification , 2016, 2016 Eighth International Conference on Advanced Computational Intelligence (ICACI).

[15]  S Suganya.,et al.  Syntax and Semantics based Efficient Text Classification Framework , 2013 .

[16]  Gurjot Kaur,et al.  E-Mail Spam Detection Using SVM and RBF , 2016 .

[17]  Amr M. Youssef,et al.  On Some Feature Selection Strategies for Spam Filter Design , 2006, 2006 Canadian Conference on Electrical and Computer Engineering.

[18]  Florentino Fernández Riverola,et al.  A dynamic model for integrating simple web spam classification techniques , 2015, Expert Syst. Appl..

[19]  Yang Xiang,et al.  Email classification using data reduction method , 2010, 2010 5th International ICST Conference on Communications and Networking in China.

[20]  Liang Ting,et al.  Spam Feature Selection Based on the Improved Mutual Information Algorithm , 2012, 2012 Fourth International Conference on Multimedia Information Networking and Security.

[21]  D. Karthika Renuka,et al.  Spam Classification Based on Supervised Learning Using Machine Learning Techniques , 2011, 2011 International Conference on Process Automation, Control and Computing.

[22]  Khairullah Khan,et al.  A Review of Machine Learning Algorithms for Text-Documents Classification , 2010 .

[23]  Ali Ahmed A. Abdelrahim,et al.  Feature selection and similarity coefficient based method for email spam filtering , 2013, 2013 INTERNATIONAL CONFERENCE ON COMPUTING, ELECTRICAL AND ELECTRONIC ENGINEERING (ICCEEE).

[24]  Hongshik Ahn,et al.  Classification of High-Dimensional Data with Ensemble of Logistic Regression Models , 2010, Journal of biopharmaceutical statistics.

[25]  Youwei Wang,et al.  Term frequency combined hybrid feature selection method for spam filtering , 2014, Pattern Analysis and Applications.

[26]  Michael Freed,et al.  Using Semantic Features to Improve Task Identification in Email Messages , 2008, NLDB.

[27]  A. Karegowda,et al.  COMPARATIVE STUDY OF ATTRIBUTE SELECTION USING GAIN RATIO AND CORRELATION BASED FEATURE SELECTION , 2010 .

[28]  Richa Tiwari,et al.  Information extraction from spam emails using stylistic and semantic features to identify spammers , 2011, 2011 IEEE International Conference on Information Reuse & Integration.

[29]  Sujeet More,et al.  Data mining with machine learning applied for email deception , 2013, 2013 International Conference on Optical Imaging Sensor and Security (ICOSS).