Neutaint: Efficient Dynamic Taint Analysis with Neural Networks

Dynamic taint analysis (DTA) is widely used by various applications to track information flow during runtime execution. Existing DTA techniques use rule-based taint-propagation, which is neither accurate (i.e., high false positive rate) nor efficient (i.e., large runtime overhead). It is hard to specify taint rules for each operation while covering all corner cases correctly. Moreover, the overtaint and undertaint errors can accumulate during the propagation of taint information across multiple operations. Finally, rule-based propagation requires each operation to be inspected before applying the appropriate rules resulting in prohibitive performance overhead on large real-world applications.In this work, we propose Neutaint, a novel end-to-end approach to track information flow using neural program embeddings. The neural program embeddings model the target’s programs computations taking place between taint sources and sinks, which automatically learns the information flow by observing a diverse set of execution traces. To perform lightweight and precise information flow analysis, we utilize saliency maps to reason about most influential sources for different sinks. Neutaint constructs two saliency maps, a popular machine learning approach to influence analysis, to summarize both coarse-grained and fine-grained information flow in the neural program embeddings.We compare Neutaint with 3 state-of-the-art dynamic taint analysis tools. The evaluation results show that Neutaint can achieve 68% accuracy, on average, which is 10% improvement while reducing 40× runtime overhead over the second-best taint tool Libdft on 6 real world programs. Neutaint also achieves 61% more edge coverage when used for taint-guided fuzzing indicating the effectiveness of the identified influential bytes. We also evaluate Neutaint’s ability to detect real world software attacks. The results show that Neutaint can successfully detect different types of vulnerabilities including buffer/heap/integer overflows, division by zero, etc. Lastly, Neutaint can detect 98.7% of total flows, the highest among all taint analysis tools.

[1]  Alessandro Orso,et al.  Dytan: a generic dynamic taint analysis framework , 2007, ISSTA '07.

[2]  David A. Wagner,et al.  This copyright notice must be included in the reproduced paper. USENIX acknowledges all trademarks herein. Detecting Format String Vulnerabilities with Type Qualifiers , 2001 .

[3]  Mark Raugas,et al.  Faster Fuzzing: Reinitialization with Deep Neural Models , 2017, ArXiv.

[4]  Herbert Bos,et al.  The BORG: Nanoprobing Binaries for Buffer Overreads , 2015, CODASPY.

[5]  Armando Solar-Lezama,et al.  sk_p: a neural program corrector for MOOCs , 2016, SPLASH.

[6]  Jing Chen,et al.  SmartSeed: Smart Seed Generation for Efficient Fuzzing , 2018, ArXiv.

[7]  Christopher Krügel,et al.  Cross Site Scripting Prevention with Dynamic Data Tainting and Static Analysis , 2007, NDSS.

[8]  Jun Wang,et al.  StraightTaint: Decoupled offline symbolic taint analysis , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[9]  Heng Yin,et al.  Panorama: capturing system-wide information flow for malware detection and analysis , 2007, CCS '07.

[10]  Rishabh Singh,et al.  Automated Correction for Syntax Errors in Programming Assignments using Recurrent Neural Networks , 2016, ArXiv.

[11]  Alex Graves,et al.  Neural Turing Machines , 2014, ArXiv.

[12]  John C. S. Lui,et al.  TaintART: A Practical Multi-level Information-Flow Tracking System for Android RunTime , 2016, CCS.

[13]  Herbert Bos,et al.  Dowsing for Overflows: A Guided Fuzzer to Find Buffer Boundary Violations , 2013, USENIX Security Symposium.

[14]  Zhenkai Liang,et al.  One Engine To Serve 'em All: Inferring Taint Rules Without Architectural Semantics , 2019, NDSS.

[15]  Herbert Bos,et al.  VUzzer: Application-aware Evolutionary Fuzzing , 2017, NDSS.

[16]  James Newsom,et al.  Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software, Network and Distributed System Security Symposium Conference Proceedings : 2005 , 2005 .

[17]  Guofei Gu,et al.  TaintScope: A Checksum-Aware Directed Fuzzing Tool for Automatic Software Vulnerability Detection , 2010, 2010 IEEE Symposium on Security and Privacy.

[18]  Herbert Bos,et al.  Pointless tainting?: evaluating the practicality of pointer tainting , 2009, EuroSys '09.

[19]  Byung-Gon Chun,et al.  TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones , 2010, OSDI.

[20]  Ananthram Swami,et al.  The Limitations of Deep Learning in Adversarial Settings , 2015, 2016 IEEE European Symposium on Security and Privacy (EuroS&P).

[21]  Pushmeet Kohli,et al.  RobustFill: Neural Program Learning under Noisy I/O , 2017, ICML.

[22]  Hao Chen,et al.  Angora: Efficient Fuzzing by Principled Search , 2018, 2018 IEEE Symposium on Security and Privacy (SP).

[23]  Shweta Shinde,et al.  Neuro-Symbolic Execution: The Feasibility of an Inductive Approach to Symbolic Execution , 2018, ArXiv.

[24]  Martin C. Rinard,et al.  Taint-based directed whitebox fuzzing , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[25]  Lihong Li,et al.  Neuro-Symbolic Program Synthesis , 2016, ICLR.

[26]  Lujo Bauer,et al.  Riding out DOMsday: Towards Detecting and Preventing DOM Cross-Site Scripting , 2018, NDSS.

[27]  Leonidas J. Guibas,et al.  Learning Program Embeddings to Propagate Feedback on Student Code , 2015, ICML.

[28]  Manu Sridharan,et al.  TAJ: effective taint analysis of web applications , 2009, PLDI '09.

[29]  Nando de Freitas,et al.  Neural Programmer-Interpreters , 2015, ICLR.

[30]  Junfeng Yang,et al.  NEUZZ: Efficient Fuzzing with Neural Program Smoothing , 2018, 2019 IEEE Symposium on Security and Privacy (SP).

[31]  Anh Tuan Nguyen,et al.  Combining Deep Learning with Information Retrieval to Localize Buggy Files for Bug Reports (N) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[32]  Le Song,et al.  Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection , 2018 .

[33]  François Chollet,et al.  Keras: The Python Deep Learning library , 2018 .

[34]  Herbert Bos,et al.  Minemu: The World's Fastest Taint Tracker , 2011, RAID.

[35]  Marc Brockschmidt,et al.  Learning to Represent Programs with Graphs , 2017, ICLR.

[36]  Stephen McCamant,et al.  DTA++: Dynamic Taint Analysis with Targeted Control-Flow Propagation , 2011, NDSS.

[37]  Jaegul Choo,et al.  End-to-End Prediction of Buffer Overruns from Raw Source Code via Neural Memory Networks , 2017, IJCAI.

[38]  Ke Wang,et al.  Dynamic Neural Program Embedding for Program Repair , 2017, ICLR.

[39]  Rishabh Singh,et al.  Learn&Fuzz: Machine learning for input fuzzing , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[40]  Shouhuai Xu,et al.  VulDeePecker: A Deep Learning-Based System for Vulnerability Detection , 2018, NDSS.

[41]  Rishabh Singh,et al.  Not all bytes are equal: Neural byte sieve for fuzzing , 2017, ArXiv.

[42]  Wouter Joosen,et al.  Software vulnerability prediction using text analysis techniques , 2012, MetriSec '12.

[43]  Akbar Siami Namin,et al.  Predicting Vulnerable Software Components through N-Gram Analysis and Statistical Feature Selection , 2015, 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA).

[44]  Angelos D. Keromytis,et al.  libdft: practical dynamic data flow tracking for commodity systems , 2012, VEE '12.

[45]  Zhi-Hua Zhou,et al.  Learning Unified Features from Natural and Programming Languages for Locating Buggy Source Code , 2016, IJCAI.

[46]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[47]  Jianqi Shi,et al.  GANFuzz: a GAN-based industrial network protocol fuzzing framework , 2018, CF.

[48]  Ben Stock,et al.  25 million flows later: large-scale detection of DOM-based XSS , 2013, CCS.

[49]  Rahul Gupta,et al.  DeepFix: Fixing Common C Language Errors by Deep Learning , 2017, AAAI.

[50]  Ankur Taly,et al.  Axiomatic Attribution for Deep Networks , 2017, ICML.

[51]  Tao Wang,et al.  Convolutional Neural Networks over Tree Structures for Programming Language Processing , 2014, AAAI.