Trustworthiness of spam email addresses using machine learning

Cybercriminals have increasingly used spam email to send scams, phishing, malware and other frauds to organisations and people. They design sophisticated and contextualised emails to make them look trustworthy for users, being the sender addresses an essential part. Although cybersecurity agencies and companies develop products and organise courses for people to detect emails patterns, spam attacks are not totally avoided yet. This work presents a proof-of-concept methodology to give the user more meaningful information about trustworthiness to detect these harmful emails. For the first time in the literature, we present an email address dataset manually labelled into two classes, low and high quality. Moreover, we extracted 18 handcrafted features based on social engineering techniques and natural language properties. We evaluated four popular machine learning classifiers and obtained the best performance with Naive Bayes, i.e., 88.17% of accuracy and 0.808 of F1-Score. Additionally, we applied the InterpretML framework to find out the most relevant properties to eventually implement an automatic system able to inform about the trustworthiness of email addresses.

[1]  Marcin Woźniak,et al.  Accurate and fast URL phishing detector: A convolutional neural network approach , 2020, Comput. Networks.

[2]  Eduardo Fidalgo,et al.  File Name Classification Approach to Identify Child Sexual Abuse , 2020, ICPRAM.

[3]  Eduardo Feitosa,et al.  Heuristic-based strategy for Phishing prediction: A survey of URL-based approach , 2020, Comput. Secur..

[4]  Rich Caruana,et al.  InterpretML: A Unified Framework for Machine Learning Interpretability , 2019, ArXiv.

[5]  Haruna Chiroma,et al.  Machine learning for email spam filtering: review, approaches and open research problems , 2019, Heliyon.

[6]  Banu Diri,et al.  Machine learning based phishing detection from URLs , 2019, Expert Syst. Appl..

[7]  Sean B. Maynard,et al.  Applying social marketing to evaluate current security education training and awareness programs in organisations , 2021, Comput. Secur..

[8]  Natalie C. Ebner,et al.  Empirical analysis of weapons of influence, life domains, and demographic-targeting in modern spam: an age-comparative perspective , 2019, Crime Science.

[9]  Peng Yang,et al.  Phishing Website Detection Based on Multidimensional Features Driven by Deep Learning , 2019, IEEE Access.

[10]  Alwyn Roshan Pais,et al.  Detection of phishing websites using an efficient feature-based machine learning framework , 2018, Neural Computing and Applications.

[11]  Tie Li,et al.  Improving malicious URLs detection via feature engineering: Linear and nonlinear space transformation methods , 2020, Inf. Syst..

[12]  Rami Mustafa A. Mohammad,et al.  A lifelong spam emails classification model , 2020, Applied Computing and Informatics.

[13]  Eduardo Fidalgo,et al.  Impact of Current Phishing Strategies in Machine Learning Models for Phishing Detection , 2019, CISIS.

[14]  Alessio Botta,et al.  2 Years in the anti-phishing group of a large company , 2021, Comput. Secur..