UNIFUZZ: A Holistic and Pragmatic Metrics-Driven Platform for Evaluating Fuzzers

A flurry of fuzzing tools (fuzzers) have been proposed in the literature, aiming at detecting software vulnerabilities effectively and efficiently. To date, it is however still challenging to compare fuzzers due to the inconsistency of the benchmarks, performance metrics, and/or environments for evaluation, which buries the useful insights and thus impedes the discovery of promising fuzzing primitives. In this paper, we design and develop UNIFUZZ, an open-source and metrics-driven platform for assessing fuzzers in a comprehensive and quantitative manner. Specifically, UNIFUZZ to date has incorporated 35 usable fuzzers, a benchmark of 20 real-world programs, and six categories of performance metrics. We first systematically study the usability of existing fuzzers, find and fix a number of flaws, and integrate them into UNIFUZZ. Based on the study, we propose a collection of pragmatic performance metrics to evaluate fuzzers from six complementary perspectives. Using UNIFUZZ, we conduct in-depth evaluations of several prominent fuzzers including AFL [1], AFLFast [2], Angora [3], Honggfuzz [4], MOPT [5], QSYM [6], T-Fuzz [7] and VUzzer64 [8]. We find that none of them outperforms the others across all the target programs, and that using a single metric to assess the performance of a fuzzer may lead to unilateral conclusions, which demonstrates the significance of comprehensive metrics. Moreover, we identify and investigate previously overlooked factors that may significantly affect a fuzzer's performance, including instrumentation methods and crash analysis tools. Our empirical results show that they are critical to the evaluation of a fuzzer. We hope that our findings can shed light on reliable fuzzing evaluation, so that we can discover promising fuzzing primitives to effectively facilitate fuzzer designs in the future.

[1]  Salvatore J. Stolfo,et al.  NEZHA: Efficient Domain-Independent Differential Testing , 2017, 2017 IEEE Symposium on Security and Privacy (SP).

[2]  Christopher Krügel,et al.  Driller: Augmenting Fuzzing Through Selective Symbolic Execution , 2016, NDSS.

[3]  Anja Feldmann,et al.  Static Program Analysis as a Fuzzing Aid , 2017, RAID.

[4]  Yang Liu,et al.  Superion: Grammar-Aware Greybox Fuzzing , 2018, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[5]  Hao Chen,et al.  Angora: Efficient Fuzzing by Principled Search , 2018, 2018 IEEE Symposium on Security and Privacy (SP).

[6]  Angelos D. Keromytis,et al.  SlowFuzz: Automated Domain-Independent Detection of Algorithmic Complexity Vulnerabilities , 2017, CCS.

[7]  Andrew E. Santosa,et al.  Smart Greybox Fuzzing , 2018, IEEE Transactions on Software Engineering.

[8]  Xu Zhou,et al.  PTfuzz: Guided Fuzzing With Processor Trace Feedback , 2018, IEEE Access.

[9]  Meng Xu,et al.  QSYM : A Practical Concolic Execution Engine Tailored for Hybrid Fuzzing , 2018, USENIX Security Symposium.

[10]  Andrew Ruef,et al.  Evaluating Fuzz Testing , 2018, CCS.

[11]  Tibor Gyimóthy,et al.  Grammarinator: a grammar-based open source fuzzer , 2018, A-TEST@ESEC/SIGSOFT FSE.

[12]  Alves-FossJim,et al.  The DARPA Cyber Grand Challenge , 2015, IEEE S&P 2015.

[13]  David A. Wagner,et al.  Dynamic Test Generation to Find Integer Bugs in x86 Binary Linux Programs , 2009, USENIX Security Symposium.

[14]  Abhik Roychoudhury,et al.  Directed Greybox Fuzzing , 2017, CCS.

[15]  Regina Nuzzo,et al.  Scientific method: Statistical errors , 2014, Nature.

[16]  Herbert Bos,et al.  VUzzer: Application-aware Evolutionary Fuzzing , 2017, NDSS.

[17]  Koushik Sen,et al.  FairFuzz: A Targeted Mutation Strategy for Increasing Greybox Fuzz Testing Coverage , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[18]  Choongwoo Han,et al.  Grey-Box Concolic Testing on Binary Code , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[19]  Yong Tang,et al.  LearnAFL: Greybox Fuzzing With Knowledge Enhancement , 2019, IEEE Access.

[20]  Ahmad-Reza Sadeghi,et al.  NAUTILUS: Fishing for Deep Bugs with Grammars , 2019, NDSS.

[21]  Abhik Roychoudhury,et al.  Coverage-Based Greybox Fuzzing as Markov Chain , 2016, IEEE Transactions on Software Engineering.

[22]  Sang Kil Cha,et al.  CodeAlchemist: Semantics-Aware Code Generation to Find Vulnerabilities in JavaScript Engines , 2019, NDSS.

[23]  Dawei Qi,et al.  SemFix: Program repair via semantic analysis , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[24]  Junfeng Yang,et al.  NEUZZ: Efficient Fuzzing with Neural Program Learning , 2018, ArXiv.

[25]  Alexander Pretschner,et al.  Compositional Fuzzing Aided by Targeted Symbolic Execution , 2019, ArXiv.

[26]  Pablo Buiras,et al.  QuickFuzz: an automatic random fuzzer for common file formats , 2016, Haskell.

[27]  Hao Chen,et al.  Matryoshka: Fuzzing Deeply Nested Branches , 2019, CCS.

[28]  Shuvendu K. Lahiri,et al.  Automatic Rootcausing for Program Equivalence Failures in Binaries , 2015, CAV.

[29]  William K. Robertson,et al.  LAVA: Large-Scale Automated Vulnerability Addition , 2016, 2016 IEEE Symposium on Security and Privacy (SP).

[30]  Christopher Krügel,et al.  SOK: (State of) The Art of War: Offensive Techniques in Binary Analysis , 2016, 2016 IEEE Symposium on Security and Privacy (SP).

[31]  Jingling Xue,et al.  A Feature-Oriented Corpus for Understanding, Evaluating and Improving Fuzz Testing , 2019, AsiaCCS.

[32]  Mathias Payer,et al.  T-Fuzz: Fuzzing by Program Transformation , 2018, 2018 IEEE Symposium on Security and Privacy (SP).

[33]  Michael Norrish,et al.  MoonLight: Effective Fuzzing with Near-Optimal Corpus Distillation , 2019, ArXiv.

[34]  Chao Zhang,et al.  MOPT: Optimized Mutation Scheduling for Fuzzers , 2019, USENIX Security Symposium.

[35]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[36]  Chao Zhang,et al.  CollAFL: Path Sensitive Fuzzing , 2018, 2018 IEEE Symposium on Security and Privacy (SP).

[37]  A. Vargha,et al.  A Critique and Improvement of the CL Common Language Effect Size Statistics of McGraw and Wong , 2000 .

[38]  N. Lazar,et al.  Moving to a World Beyond “p < 0.05” , 2019, The American Statistician.

[39]  Yang Liu,et al.  Steelix: program-state based binary fuzzing , 2017, ESEC/SIGSOFT FSE.