Efficient Vulnerability Detection based on abstract syntax tree and Deep Learning

The automatic vulnerability detection on program source code is an important research topic. With the development of artificial intelligence, deep learning has been applied to vulnerability detection. Existing methods do not make full use of the syntax structure of the program source code that only treats the code as plain text, which brings much redundancy. Moreover, to avoid computation overhead caused by redundancy, existing methods often use the truncate method to process variable-length data, which also cause data loss. In this paper, we propose a data processing method based on the abstract syntax tree to extract all syntax features and reduce data redundancy. Besides, we apply the pack-padded method on the Bi-GRU network to train variable-length data without truncation and padding. Compared with the current methods, our framework does not rely on the experts or predefined rules so that it is suitable to process a large number of source code. To evaluate the ability of our framework, we collect the vulnerability dataset that includes more than 260,000 functions in 118 types of CWE, which is larger than the dataset of existing research. Experiments show that our framework has better performance than existing methods.

[1]  Steven Skiena,et al.  DeepWalk: online learning of social representations , 2014, KDD.

[2]  Martin White,et al.  Deep learning code fragments for code clone detection , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[3]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[4]  Jure Leskovec,et al.  node2vec: Scalable Feature Learning for Networks , 2016, KDD.

[5]  Wei Luo,et al.  Cross-Project Transfer Representation Learning for Vulnerable Function Discovery , 2018, IEEE Transactions on Industrial Informatics.

[6]  David Brumley,et al.  ReDeBug: Finding Unpatched Code Clones in Entire OS Distributions , 2012, 2012 IEEE Symposium on Security and Privacy.

[7]  Shouhuai Xu,et al.  $\mu$μVulDeePecker: A Deep Learning-Based System for Multiclass Vulnerability Detection , 2021, IEEE Transactions on Dependable and Secure Computing.

[8]  Wouter Joosen,et al.  Predicting Vulnerable Software Components via Text Mining , 2014, IEEE Transactions on Software Engineering.

[9]  Heejo Lee,et al.  VUDDY: A Scalable Approach for Vulnerable Code Clone Discovery , 2017, 2017 IEEE Symposium on Security and Privacy (SP).

[10]  Shouling Ji,et al.  VulSniper: Focus Your Attention to Shoot Fine-Grained Vulnerabilities , 2019, IJCAI.

[11]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[12]  Mohammad Zulkernine,et al.  Using complexity, coupling, and cohesion metrics as early indicators of vulnerabilities , 2011, J. Syst. Archit..

[13]  Konrad Rieck,et al.  Generalized vulnerability extrapolation using abstract syntax trees , 2012, ACSAC '12.

[14]  Shouhuai Xu,et al.  VulPecker: an automated vulnerability detection system based on code similarity analysis , 2016, ACSAC.

[15]  Shouhuai Xu,et al.  VulDeePecker: A Deep Learning-Based System for Vulnerability Detection , 2018, NDSS.

[16]  Wouter Joosen,et al.  Software vulnerability prediction using text analysis techniques , 2012, MetriSec '12.