An Explainable Method of Phishing Emails Generation and Its Application in Machine Learning

The fact that phishing emails cannot be released because they contain private information greatly hinders researchers from obtaining large-scale samples of real phishing emails. At the same time, the legal email data set is often collected from a certain field, which is quite different from the phishing samples, and it is easy to cause overfitting or spatial bias during the process of building the model. This paper proposes a method for generating phishing emails based on data insertion, which can increase the number of phishing samples without changing the malicious attributes, solve the problem of spatial bias during model training, and can reduce the difference in statistical characteristics between benign and malicious samples to a certain extent. Based on the differences in the email HTML content of the Phishing dataset and the Enron dataset, this paper implements six resource generators and a communication relationship selector. It controls the generation of new samples by implementing control-quantity sequence pairs, and proposes quantitative evaluation methods and indicators of the classifier's generalization ability, and verified that the newly generated samples can be used to train a classifier with stronger generalization ability. The main contribution of this paper is to propose a method to provide the model with higher quality data.