Email security level classification of imbalanced data using artificial neural network: The real case in a world-leading enterprise

Abstract Email is far more convenient than traditional mail in the delivery of messages. However, it is susceptible to information leakage in business. This problem can be alleviated by classifying emails into different security levels using text mining and machine learning technology. In this research, we developed a scheme in which a neural network is used to extract information from emails to enable its transformation into a multidimensional vector. Email text data is processed using bi-gram to train the document vector, which then undergoes under-sampling to deal with the problem of data imbalance. Finally, the security label of emails is classified using an artificial neural network. The proposed system was evaluated in an actual corporate setting. The results show that the proposed feature extraction approach is more effective than existing methods for the representations of email data in true positive rates and F1-scores.

[1]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[2]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[3]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[4]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[5]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[6]  Hayretdin Bahsi,et al.  Security Level Classification of Confidential Documents Written in Turkish , 2009, UCMedia.

[7]  Mohanaad Shakir,et al.  Model of security level classification for data in hybrid cloud computing , 2016 .

[8]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[9]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[10]  Anil K. Jain,et al.  Document Representation and Its Application to Page Decomposition , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Adem Karahoca,et al.  Security‐level classification for confidential documents by using adaptive neuro‐fuzzy inference systems , 2013, Expert Syst. J. Knowl. Eng..

[12]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[13]  David Sánchez,et al.  Toward sensitive document release with privacy guarantees , 2017, Eng. Appl. Artif. Intell..

[14]  Robert Hecht-Nielsen,et al.  Theory of the backpropagation neural network , 1989, International 1989 Joint Conference on Neural Networks.

[15]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[16]  Yan-Ping Zhang,et al.  Cluster-based majority under-sampling approaches for class imbalance learning , 2010, 2010 2nd IEEE International Conference on Information and Financial Engineering.

[17]  Ngoc Thanh Nguyen,et al.  A combined negative selection algorithm-particle swarm optimization for an email spam detection system , 2015, Eng. Appl. Artif. Intell..

[18]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machine Classifiers , 1999, Neural Processing Letters.

[19]  Youngjoong Ko,et al.  Speech-Act Classification Using a Convolutional Neural Network Based on POS Tag and Dependency-Relation Bigram Embedding , 2017, IEICE Trans. Inf. Syst..

[20]  Zhi-Hua Zhou,et al.  Ieee Transactions on Knowledge and Data Engineering 1 Training Cost-sensitive Neural Networks with Methods Addressing the Class Imbalance Problem , 2022 .

[21]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[22]  David A. Landgrebe,et al.  A survey of decision tree classifier methodology , 1991, IEEE Trans. Syst. Man Cybern..

[23]  Cheng-Lin Liu,et al.  TR-LDA: A Cascaded Key-Bigram Extractor for Microblog Summarization , 2015 .

[24]  Xiaojie Wang,et al.  Learning pairwise comparisons of items with bigram content features for recommending , 2013, Proceedings of 2013 3rd International Conference on Computer Science and Network Technology.