Fine Grained Dataflow Tracking with Proximal Gradients

Dataflow tracking with Dynamic Taint Analysis (DTA) is an important method in systems security with many applications, including exploit analysis, guided fuzzing, and side-channel information leak detection. However, DTA is fundamentally limited by the boolean nature of taint labels, which provide no information about the significance of detected dataflows and lead to false positives/negatives on complex real world programs. We introduce proximal gradient analysis (PGA), a novel theoretically grounded approach that can track more accurate and fine-grained dataflow information than dynamic taint analysis. We observe that the gradients of neural networks precisely track dataflow and have been used widely for different data-flow-guided tasks like generating adversarial inputs and interpreting their decisions. However, programs, unlike neural networks, contain many discontinuous operations for which gradients cannot be computed. Our key insight is that we can efficiently approximate gradients over discontinuous operations by computing proximal gradients, a mathematically rigorous generalization of gradients for discontinuous functions. Proximal gradients allow us to apply the chain rule of calculus to accurately compose and propagate gradients over a program with minimal error. We compare our prototype PGA implementation two state of the art DTA implementations, DataFlowSanitizer and libdft, on 7 real-world programs. Our results show that PGA can improve the F1 accuracy of data flow tracking by up to 33% over taint tracking without introducing any significant overhead (<5% on average). We further demonstrate the effectiveness of PGA by discovering 23 previously unknown security vulnerabilities and 2 side-channel leaks, and analyzing 9 existing CVEs in the tested programs.

[1]  Byung-Gon Chun,et al.  TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones , 2010, OSDI.

[2]  Martin C. Rinard,et al.  Taint-based directed whitebox fuzzing , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[3]  Andreas Griewank,et al.  Introduction to Automatic Differentiation , 2003 .

[4]  Bastian Goldlücke,et al.  Variational Analysis , 2014, Computer Vision, A Reference Guide.

[5]  Saumya Debray,et al.  Bit-Level Taint Analysis , 2014, 2014 IEEE 14th International Working Conference on Source Code Analysis and Manipulation.

[6]  Herbert Bos,et al.  VUzzer: Application-aware Evolutionary Fuzzing , 2017, NDSS.

[7]  Swarat Chaudhuri,et al.  Smoothing a Program Soundly and Robustly , 2011, CAV.

[8]  F. Clarke Optimization And Nonsmooth Analysis , 1983 .

[9]  Yurii Nesterov,et al.  Lexicographic differentiation of nonsmooth functions , 2005, Math. Program..

[10]  Zhenkai Liang,et al.  One Engine To Serve 'em All: Inferring Taint Rules Without Architectural Semantics , 2019, NDSS.

[11]  Jun Wang,et al.  StraightTaint: Decoupled offline symbolic taint analysis , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[12]  Marco Gaboardi,et al.  PLAS 2017: ACM SIGSAC Workshop on Programming Languages and Analysis for Security , 2017, CCS.

[13]  Heng Yin,et al.  Panorama: capturing system-wide information flow for malware detection and analysis , 2007, CCS '07.

[14]  R. D. Richtmyer,et al.  Difference methods for initial-value problems , 1959 .

[15]  Vitaly Shmatikov,et al.  Memento: Learning Secrets from Process Footprints , 2012, 2012 IEEE Symposium on Security and Privacy.

[16]  James Newsome,et al.  Dynamic Taint Analysis for Automatic Detection, Analysis, and SignatureGeneration of Exploits on Commodity Software , 2005, NDSS.

[17]  Hao Chen,et al.  Angora: Efficient Fuzzing by Principled Search , 2018, 2018 IEEE Symposium on Security and Privacy (SP).

[18]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[19]  Rui Wang,et al.  Side-Channel Leaks in Web Applications: A Reality Today, a Challenge Tomorrow , 2010, 2010 IEEE Symposium on Security and Privacy.

[20]  Mário S. Alvim,et al.  Additive and Multiplicative Notions of Leakage, and Their Capacities , 2014, 2014 IEEE 27th Computer Security Foundations Symposium.

[21]  Amir Beck,et al.  First-Order Methods in Optimization , 2017 .

[22]  Herbert Bos,et al.  Pointer tainting still pointless: (but we all see the point of tainting) , 2010, OPSR.

[23]  Suman Jana,et al.  Learning nonlinear loop invariants with gated continuous logic networks , 2020, PLDI.

[24]  Herbert Bos,et al.  Minemu: The World's Fastest Taint Tracker , 2011, RAID.

[25]  Geoffrey Smith,et al.  Min-entropy as a resource , 2013, Inf. Comput..

[26]  Herbert Bos,et al.  Pointless tainting?: evaluating the practicality of pointer tainting , 2009, EuroSys '09.

[27]  A. Griewank Automatic Directional Differentiation of Nonsmooth Composite Functions , 1995 .

[28]  D. E. Ward Chain rules for nonsmooth functions , 1991 .

[29]  Angelos D. Keromytis,et al.  libdft: practical dynamic data flow tracking for commodity systems , 2012, VEE '12.

[30]  David Brumley,et al.  All You Ever Wanted to Know about Dynamic Taint Analysis and Forward Symbolic Execution (but Might Have Been Afraid to Ask) , 2010, 2010 IEEE Symposium on Security and Privacy.

[31]  R. Cook Assessment of Local Influence , 1986 .

[32]  Jacques Klein,et al.  FlowDroid: precise context, flow, field, object-sensitive and lifecycle-aware taint analysis for Android apps , 2014, PLDI.

[33]  Stephen P. Boyd,et al.  Proximal Algorithms , 2013, Found. Trends Optim..

[34]  Lok K. Yan,et al.  On Soundness and Precision of Dynamic Taint Analysis , 2014 .

[35]  T. Sanders,et al.  Analysis of Boolean Functions , 2012, ArXiv.

[36]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[37]  R. E. Wengert,et al.  A simple automatic derivative evaluation program , 1964, Commun. ACM.

[38]  Swarat Chaudhuri,et al.  Smooth interpretation , 2010, PLDI '10.

[39]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[40]  Junfeng Yang,et al.  NEUZZ: Efficient Fuzzing with Neural Program Smoothing , 2018, 2019 IEEE Symposium on Security and Privacy (SP).

[41]  Andrey Rybalchenko,et al.  Approximation and Randomization for Quantitative Information-Flow Analysis , 2010, 2010 23rd IEEE Computer Security Foundations Symposium.

[42]  Christopher Krügel,et al.  Saner: Composing Static and Dynamic Analysis to Validate Sanitization in Web Applications , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[43]  Stephen McCamant,et al.  Quantitative information flow as network flow capacity , 2008, PLDI '08.

[44]  Suman Jana,et al.  DeepTest: Automated Testing of Deep-Neural-Network-Driven Autonomous Cars , 2017, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[45]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[46]  Alessandro Orso,et al.  Dytan: a generic dynamic taint analysis framework , 2007, ISSTA '07.

[47]  Andrew C. Myers,et al.  JFlow: practical mostly-static information flow control , 1999, POPL '99.

[48]  Suman Jana,et al.  CLN2INV: Learning Loop Invariants with Continuous Logic Networks , 2019, ICLR.

[49]  Motoaki Kawanabe,et al.  How to Explain Individual Classification Decisions , 2009, J. Mach. Learn. Res..

[50]  Meng Xu,et al.  QSYM : A Practical Concolic Execution Engine Tailored for Hybrid Fuzzing , 2018, USENIX Security Symposium.

[51]  Jan Reineke,et al.  CacheAudit: A Tool for the Static Analysis of Cache Side Channels , 2013, TSEC.

[52]  Heng Yin,et al.  DroidScope: Seamlessly Reconstructing the OS and Dalvik Semantic Views for Dynamic Android Malware Analysis , 2012, USENIX Security Symposium.

[53]  Stephen McCamant,et al.  Measuring channel capacity to distinguish undue influence , 2009, PLAS '09.

[54]  Avanti Shrikumar,et al.  Learning Important Features Through Propagating Activation Differences , 2017, ICML.

[55]  Yizheng Chen,et al.  Neutaint: Efficient Dynamic Taint Analysis with Neural Networks , 2019, 2020 IEEE Symposium on Security and Privacy (SP).

[56]  Pasquale Malacaria,et al.  Quantifying information leaks in software , 2010, ACSAC '10.

[57]  Geoffrey Smith,et al.  Quantifying Information Flow Using Min-Entropy , 2011, 2011 Eighth International Conference on Quantitative Evaluation of SysTems.

[58]  Junfeng Yang,et al.  DeepXplore: Automated Whitebox Testing of Deep Learning Systems , 2017, SOSP.