The Comparison of Word Embedding Techniques in RNNs for Vulnerability Detection

Many studies have combined Deep Learning and Natural Language Processing (NLP) techniques in security systems in performing tasks such as bug detection, vulnerability prediction, or classification. Most of these works relied on NLP embedding methods to generate input vectors for the deep learning models. However, there are many existing embedding methods to encode software text files into vectors, and the structures of neural networks are immense and heuristic. This leads to a challenge for the researcher to choose the appropriate combination of embedding techniques and the model structure for training the vulnerability detection classifiers. For this task, we propose a system to investigate the use of four popular word embedding techniques combined with four different recurrent neural networks (RNNs), including both bidirectional RNNs (BRNNs) and unidirectional RNNs. We trained and evaluated the models by using two types of vulnerable function datasets written in C code. Our results showed that the FastText embedding technique combined with BRNNs produced the most efficient detection rate, compared to other combinations, on a real-world but not on an artificially-produced dataset. Further experiments on other datasets are necessary to confirm this result.

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  Jun Zhang,et al.  Software Vulnerability Discovery via Learning Multi-Domain Knowledge Bases , 2021, IEEE Transactions on Dependable and Secure Computing.

[3]  Shouhuai Xu,et al.  VulDeePecker: A Deep Learning-Based System for Vulnerability Detection , 2018, NDSS.

[4]  Koushik Sen,et al.  Deep Learning to Find Bugs , 2017 .

[5]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[6]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[7]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[8]  Onur Ozdemir,et al.  Automated Vulnerability Detection in Source Code Using Deep Representation Learning , 2018, 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA).

[9]  Wei Xiao,et al.  Deep Learning-Based Vulnerable Function Detection: A Benchmark , 2019, ICICS.

[10]  Sang Peter Chin,et al.  Automated software vulnerability detection with machine learning , 2018, ArXiv.

[11]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[12]  Premkumar T. Devanbu,et al.  A Survey of Machine Learning for Big Code and Naturalness , 2017, ACM Comput. Surv..

[13]  Paul E. Black,et al.  Juliet 1.3 test suite: changes from 1.2 , 2018 .

[14]  Mourad Debbabi,et al.  The Use of NLP Techniques in Static Code Analysis to Detect Weaknesses and Vulnerabilities , 2014, Canadian Conference on AI.

[15]  Xiaojiang Du,et al.  A deep learning based static taint analysis approach for IoT software vulnerability location , 2020 .

[16]  Hai Jin,et al.  A Comparative Study of Deep Learning-Based Vulnerability Detection System , 2019, IEEE Access.

[17]  Shuvendu K. Lahiri,et al.  Code vectors: understanding programs through embedded abstracted symbolic traces , 2018, ESEC/SIGSOFT FSE.

[18]  Jimmy J. Lin,et al.  Exploiting Token and Path-based Representations of Code for Identifying Security-Relevant Commits , 2019, ArXiv.

[19]  Timofey Bryksin,et al.  PathMiner: A Library for Mining of Path-Based Representations of Code , 2019, 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR).

[20]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[21]  Yong Fang,et al.  FastEmbed: Predicting vulnerability exploitation possibility based on ensemble machine learning algorithm , 2020, PloS one.

[22]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[23]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.