TAP: A static analysis model for PHP vulnerabilities based on token and deep learning technology

With the widespread usage of Web applications, the security issues of source code are increasing. The exposed vulnerabilities seriously endanger the interests of service providers and customers. There are some models for solving this problem. However, most of them rely on complex graphs generated from source code or regex patterns based on expert experience. In this paper, TAP, which is based on token mechanism and deep learning technology, was proposed as an analysis model to discover the vulnerabilities of PHP: Hypertext Preprocessor (PHP) Web programs conveniently and easily. Based on the token mechanism of PHP language, a custom tokenizer was designed, and it unifies tokens, supports some features of PHP and optimizes the parsing. Besides, the tokenizer also implements parameter iteration to achieve data flow analysis. On the Software Assurance Reference Dataset(SARD) and SQLI-LABS dataset, we trained the deep learning model of TAP by combining the word2vec model with Long Short-Term Memory (LSTM) network algorithm. According to the experiment on the dataset of CWE-89, TAP not only achieves the 0.9941 Area Under the Curve(AUC), which is better than other models, but also achieves the highest accuracy: 0.9787. Further, compared with RIPS, TAP shows much better in multiclass classification with 0.8319 Kappa and 0.0840 hamming distance.

[1]  Christopher Krügel,et al.  Pixy: a static analysis tool for detecting Web application vulnerabilities , 2006, 2006 IEEE Symposium on Security and Privacy (S&P'06).

[2]  Hua Wang,et al.  Privacy-Preserving Task Recommendation Services for Crowdsourcing , 2021, IEEE Transactions on Services Computing.

[3]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[4]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[5]  Vitaly Shmatikov,et al.  SAFERPHP: finding semantic vulnerabilities in PHP applications , 2011, PLAS '11.

[6]  Bill Cheswick,et al.  Firewalls and internet security - repelling the wily hacker , 2003, Addison-Wesley professional computing series.

[7]  Arjen Hommersom,et al.  Discovering software vulnerabilities using data-flow analysis and machine learning , 2018, ARES.

[8]  Wojciech Zaremba,et al.  Recurrent Neural Network Regularization , 2014, ArXiv.

[9]  Frank Tip,et al.  A survey of program slicing techniques , 1994, J. Program. Lang..

[10]  Ah Chung Tsoi,et al.  Face recognition: a convolutional neural-network approach , 1997, IEEE Trans. Neural Networks.

[11]  Elizabeth Fong,et al.  Large Scale Generation of Complex and Faulty PHP Test Cases , 2016, 2016 IEEE International Conference on Software Testing, Verification and Validation (ICST).

[12]  Michael Backes,et al.  Efficient and Flexible Discovery of PHP Application Vulnerabilities , 2017, 2017 IEEE European Symposium on Security and Privacy (EuroS&P).

[13]  Konrad Rieck,et al.  Modeling and Discovering Vulnerabilities with Code Property Graphs , 2014, 2014 IEEE Symposium on Security and Privacy.

[14]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[15]  Xiaohong Jiang,et al.  MtMR: Ensuring MapReduce Computation Integrity with Merkle Tree-Based Verifications , 2018, IEEE Transactions on Big Data.

[16]  James C. King,et al.  Symbolic execution and program testing , 1976, CACM.

[17]  V. N. Venkatakrishnan,et al.  NAVEX: Precise and Scalable Exploit Generation for Dynamic Web Applications , 2018, USENIX Security Symposium.

[18]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[19]  Onur Ozdemir,et al.  Automated Vulnerability Detection in Source Code Using Deep Representation Learning , 2018, 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA).

[20]  Shouhuai Xu,et al.  VulDeePecker: A Deep Learning-Based System for Vulnerability Detection , 2018, NDSS.

[21]  Johannes Dahse,et al.  RIPS: A static source code analyser for vulnerabilities in PHP scripts , 2010 .

[22]  Marcus Pendleton,et al.  A Survey on Systems Security Metrics , 2016, ACM Comput. Surv..

[23]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[24]  Christopher Krügel,et al.  Fear the EAR: discovering and mitigating execution after redirect vulnerabilities , 2011, CCS '11.

[25]  Thorsten Holz,et al.  Code Reuse Attacks in PHP: Automated POP Chain Generation , 2014, CCS.

[26]  Felix FX Lindner,et al.  Vulnerability Extrapolation: Assisted Discovery of Vulnerabilities Using Machine Learning , 2011, WOOT.