PreNNsem: A Heterogeneous Ensemble Learning Framework for Vulnerability Detection in Software

Automated vulnerability detection is one of the critical issues in the realm of software security. Existing solutions to this problem are mostly based on features that are defined by human experts and directly lead to missed potential vulnerability. Deep learning is an effective method for automating the extraction of vulnerability characteristics. Our paper proposes intelligent and automated vulnerability detection while using deep representation learning and heterogeneous ensemble learning. Firstly, we transform sample data from source code by removing segments that are unrelated to the vulnerability in order to reduce code analysis and improve detection efficiency in our experiments. Secondly, we represent the sample data as real vectors by pre-training on the corpus and maintaining its semantic information. Thirdly, the vectors are fed to a deep learning model to obtain the features of vulnerability. Lastly, we train a heterogeneous ensemble classifier. We analyze the effectiveness and resource consumption of different network models, pre-training methods, classifiers, and vulnerabilities separately in order to evaluate the detection method. We also compare our approach with some well-known vulnerability detection commercial tools and academic methods. The experimental results show that our proposed method provides improvements in false positive rate, false negative rate, precision, recall, and F1 score.

[1]  Shangqing Liu,et al.  Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks , 2019, NeurIPS.

[2]  Patrick Pantel,et al.  Inducing Ontological Co-occurrence Vectors , 2005, ACL.

[3]  Mamoun Alazab,et al.  A Visualized Botnet Detection System Based Deep Learning for the Internet of Things Networks of Smart Cities , 2020, IEEE Transactions on Industry Applications.

[4]  Konrad Rieck,et al.  Generalized vulnerability extrapolation using abstract syntax trees , 2012, ACSAC '12.

[5]  Qing-Long Han,et al.  DeepBalance: Deep-Learning and Fuzzy Oversampling for Vulnerability Detection , 2020, IEEE Transactions on Fuzzy Systems.

[6]  Qin Zheng,et al.  Image-Based malware classification using ensemble of CNN architectures (IMCEC) , 2020, Comput. Secur..

[7]  Konrad Rieck,et al.  Chucky: exposing missing checks in source code for vulnerability discovery , 2013, CCS.

[8]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[9]  Yong Fang,et al.  FastEmbed: Predicting vulnerability exploitation possibility based on ensemble machine learning algorithm , 2020, PloS one.

[10]  Suresh N. Mali,et al.  Classifier Ensemble Design for Imbalanced Data Classification: A Hybrid Approach☆ , 2016 .

[11]  Jun Zhang,et al.  POSTER: Vulnerability Discovery with Function Representation Learning from Unlabeled Projects , 2017, CCS.

[12]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[13]  Mamoun Alazab,et al.  Big Data for Cybersecurity: Vulnerability Disclosure Trends and Dependencies , 2019, IEEE Transactions on Big Data.

[14]  Indrajit Ray,et al.  To Fear or Not to Fear That is the Question: Code Characteristics of a Vulnerable Functionwith an Existing Exploit , 2016, CODASPY.

[15]  Tao Xie,et al.  Alattin: Mining Alternative Patterns for Detecting Neglected Conditions , 2009, 2009 IEEE/ACM International Conference on Automated Software Engineering.

[16]  Shouling Ji,et al.  VulSniper: Focus Your Attention to Shoot Fine-Grained Vulnerabilities , 2019, IJCAI.

[17]  Bihuan Chen,et al.  MVP: Detecting Vulnerabilities using Patch-Enhanced Vulnerability Signatures , 2020, USENIX Security Symposium.

[18]  R. Vinayakumar,et al.  A hybrid deep learning image-based analysis for effective malware detection , 2019, J. Inf. Secur. Appl..

[19]  Xin Li,et al.  Automated Vulnerability Detection in Source Code Using Minimum Intermediate Representation Learning , 2020, Applied Sciences.

[20]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[21]  Shouhuai Xu,et al.  VulDeePecker: A Deep Learning-Based System for Vulnerability Detection , 2018, NDSS.

[22]  Zhi Jin,et al.  Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree , 2020, 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[23]  funcGNN , 2020, Proceedings of the 14th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM).

[24]  Felix FX Lindner,et al.  Vulnerability Extrapolation: Assisted Discovery of Vulnerabilities Using Machine Learning , 2011, WOOT.

[25]  Saurabh Tewari,et al.  A comparative study of heterogeneous ensemble methods for the identification of geological lithofacies , 2020, Journal of Petroleum Exploration and Production Technology.

[26]  Laurie A. Williams,et al.  Searching for a Needle in a Haystack: Predicting Security Vulnerabilities for Windows Vista , 2010, 2010 Third International Conference on Software Testing, Verification and Validation.

[27]  Gang Zhao,et al.  DeepSim: deep learning code functional similarity , 2018, ESEC/SIGSOFT FSE.

[28]  Karl Meinke,et al.  funcGNN: A Graph Neural Network Approach to Program Similarity , 2020, ESEM.

[29]  Akbar Siami Namin,et al.  Predicting Vulnerable Software Components through N-Gram Analysis and Statistical Feature Selection , 2015, 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA).

[30]  Jun Zhang,et al.  Software Vulnerability Discovery via Learning Multi-Domain Knowledge Bases , 2021, IEEE Transactions on Dependable and Secure Computing.

[31]  Konrad Rieck,et al.  Automatic Inference of Search Patterns for Taint-Style Vulnerabilities , 2015, 2015 IEEE Symposium on Security and Privacy.