Labeling malicious communication samples based on semi-supervised deep neural network

The limited labeled sample data in the field of advanced security threats detection seriously restricts the effective development of research work. Learning the sample labels from the labeled and unlabeled data has received a lot of research attention and various universal labeling methods have been proposed. However, the labeling task of malicious communication samples targeted at advanced threats has to face the two practical challenges: the difficulty of extracting effective features in advance and the complexity of the actual sample types. To address these problems, we proposed a sample labeling method for malicious communication based on semi-supervised deep neural network. This method supports continuous learning and optimization feature representation while labeling sample, and can handle uncertain samples that are outside the concerned sample types. According to the experimental results, our proposed deep neural network can automatically learn effective feature representation, and the validity of features is close to or even higher than that of features which extracted based on expert knowledge. Furthermore, our proposed method can achieve the labeling accuracy of 97.64%∼98.50%, which is more accurate than the train-then-detect, kNN and LPA methods in any labeled-sample proportion condition. The problem of insufficient labeled samples in many network attack detecting scenarios, and our proposed work can function as a reference for the sample labeling tasks in the similar real-world scenarios.

[1]  Yu-Lin He,et al.  Fuzziness based semi-supervised learning approach for intrusion detection system , 2017, Inf. Sci..

[2]  Zhen Jiang,et al.  A hybrid generative/discriminative method for semi-supervised classification , 2013, Knowl. Based Syst..

[3]  Tao Mei,et al.  Graph-based semi-supervised learning with multiple labels , 2009, J. Vis. Commun. Image Represent..

[4]  Mehmed M. Kantardzic,et al.  On the reliable detection of concept drift from streaming unlabeled data , 2017, Expert Syst. Appl..

[5]  Xijin Tang,et al.  TESC: An approach to TExt classification using Semi-supervised Clustering , 2015, Knowl. Based Syst..

[6]  Gary Marcus,et al.  Deep Learning: A Critical Appraisal , 2018, ArXiv.

[7]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Naonori Ueda,et al.  Semisupervised Learning for a Hybrid Generative/Discriminative Classifier based on the Maximum Entropy Principle , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Ikumi Suzuki,et al.  Centered kNN Graph for Semi-Supervised Learning , 2017, SIGIR.

[10]  Jane You,et al.  Semi-Supervised Ensemble Clustering Based on Selected Constraint Projection , 2018, IEEE Transactions on Knowledge and Data Engineering.

[11]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[12]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[13]  Jian Yu,et al.  A novel semi-supervised learning framework with simultaneous text representing , 2012, Knowledge and Information Systems.

[14]  Zoubin Ghahramani,et al.  Learning from labeled and unlabeled data with label propagation , 2002 .

[15]  Anna Shcherbina,et al.  Not Just a Black Box: Learning Important Features Through Propagating Activation Differences , 2016, ArXiv.

[16]  Zihan Zhou,et al.  Label Information Guided Graph Construction for Semi-Supervised Learning , 2017, IEEE Transactions on Image Processing.

[17]  Liang Chen,et al.  Semi-Supervised Learning Through Label Propagation on Geodesics , 2018, IEEE Transactions on Cybernetics.

[18]  Sumy Joseph,et al.  Semi-supervised clustering with soft labels , 2015, 2015 International Conference on Control Communication & Computing India (ICCC).

[19]  Bhawna Nigam,et al.  Network Intrusion Detection using Semi Supervised Support Vector Machine , 2014 .

[20]  Seungjin Choi,et al.  Label propagation through minimax paths for scalable semi-supervised learning , 2014, Pattern Recognit. Lett..

[21]  Rajendra Kumar Roul,et al.  Semi-supervised clustering using seeded-kMeans in the feature space of ELM , 2016, 2016 IEEE Annual India Conference (INDICON).

[22]  Chunfu Jia,et al.  An Active and Dynamic Botnet Detection Approach to Track Hidden Concept Drift , 2017, ICICS.

[23]  Sungzoon Cho,et al.  Semi-supervised support vector regression based on self-training with label uncertainty: An application to virtual metrology in semiconductor manufacturing , 2016, Expert Syst. Appl..