Leveraging Documentation to Test Deep Learning Library Functions

It is integral to test API functions of widely-used deep learning (DL) libraries. The effectiveness of such testing requires DL-specific input constraints of these API functions. Such constraints enable the generation of valid inputs, i.e., inputs that follow these DL-specific constraints, to explore deep to test the core functionality of API functions. Existing fuzzers have no knowledge of such constraints, and existing constraint-extraction techniques are ineffective for extracting DL-specific input constraints. To fill this gap, we design and implement a document-guided fuzzing technique—D2C—for API functions of DL libraries. D2C leverages sequential pattern mining to generate rules for extracting DL-specific constraints from API documents and uses these constraints to guide the fuzzing to generate valid inputs automatically. D2C also generates inputs that violate these constraints to test the input validity checking code. In addition, D2C uses the constraints to generate boundary inputs to detect more bugs. Our evaluation of three popular DL libraries (TensorFlow, PyTorch, and MXNet) shows that D2C’s accuracy in extracting input constraints is 83.3–90.0%. D2C detects 121 bugs, while a baseline fuzzer without input constraints detects only 68 bugs. Most (89) of the 121 bugs are previously unknown, 54 of which have been fixed or confirmed by developers after we report them. In addition, D2C detects 38 inconsistencies within documents, including 28 that are fixed or confirmed after we report them.

[1]  Lei Ma,et al.  DeepMutation: Mutation Testing of Deep Learning Systems , 2018, 2018 IEEE 29th International Symposium on Software Reliability Engineering (ISSRE).

[2]  Tao Xie,et al.  Inferring method specifications from natural language API descriptions , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[3]  A. Jefferson Offutt,et al.  Generating Tests from UML Specifications , 1999, UML.

[4]  Yue Zhang,et al.  Automatic early defects detection in use case documents , 2014, ASE.

[5]  Nizar R. Mabroukeh,et al.  A taxonomy of sequential pattern mining algorithms , 2010, CSUR.

[6]  Alessandra Gorla,et al.  Automatic generation of oracles for exceptional behaviors , 2016, ISSTA.

[7]  Pushmeet Kohli,et al.  Rigorous Agent Evaluation: An Adversarial Approach to Uncover Catastrophic Failures , 2018, ICLR.

[8]  Lei Ma,et al.  DeepMutation++: A Mutation Testing Framework for Deep Learning Systems , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[9]  Matthew Wicker,et al.  Feature-Guided Black-Box Safety Testing of Deep Neural Networks , 2017, TACAS.

[10]  Gordon Fraser,et al.  Whole Test Suite Generation , 2013, IEEE Transactions on Software Engineering.

[11]  Sudipta Chattopadhyay,et al.  Grammar Based Directed Testing of Machine Learning Systems , 2019, ArXiv.

[12]  Ming Yan,et al.  Deep learning library testing via effective model generation , 2020, ESEC/SIGSOFT FSE.

[13]  Yuriy Brun,et al.  Automatically Generating Precise Oracles from Structured Natural Language Specifications , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[14]  Sebastian Fischmeister,et al.  em-SPADE: a compiler extension for checking rules extracted from processor specifications , 2014, LCTES '14.

[15]  Mathias Payer,et al.  T-Fuzz: Fuzzing by Program Transformation , 2018, 2018 IEEE Symposium on Security and Privacy (SP).

[16]  James Demmel,et al.  IEEE Standard for Floating-Point Arithmetic , 2008 .

[17]  Lin Tan,et al.  CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[18]  Lei Ma,et al.  DeepHunter: a coverage-guided fuzz testing framework for deep neural networks , 2019, ISSTA.

[19]  Wencong Xiao,et al.  An Empirical Study on Program Failures of Deep Learning Jobs , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[20]  Xuan Deng,et al.  Discovering discrepancies in numerical libraries , 2020, ISSTA.

[21]  Tao Xie,et al.  Inferring dependency constraints on parameters for web services , 2013, WWW.

[22]  Alessandra Gorla,et al.  Translating code comments to procedure specifications , 2018, ISSTA.

[23]  Gary T. Leavens,et al.  @tComment: Testing Javadoc Comments to Detect Comment-Code Inconsistencies , 2012, 2012 IEEE Fifth International Conference on Software Testing, Verification and Validation.

[24]  Tao Lv,et al.  RTFM! Automatic Assumption Discovery and Verification Derivation from Library Document for API Misuse Detection , 2020, CCS.

[25]  Ashutosh Trivedi,et al.  Detecting and understanding real-world differential performance bugs in machine learning libraries , 2020, ISSTA.

[26]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[27]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[28]  Saikat Dutta,et al.  Testing probabilistic programming systems , 2018, ESEC/SIGSOFT FSE.

[29]  Koushik Sen,et al.  FairFuzz: A Targeted Mutation Strategy for Increasing Greybox Fuzz Testing Coverage , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[30]  Yannis Smaragdakis,et al.  Static Analysis of Shape in TensorFlow Programs , 2020, ECOOP.

[31]  Hridesh Rajan,et al.  A comprehensive study on deep learning bug characteristics , 2019, ESEC/SIGSOFT FSE.

[32]  Ganesh Gopalakrishnan,et al.  Efficient search for inputs causing high floating-point errors , 2014, PPoPP '14.

[33]  Tao Xie,et al.  Multiple-Implementation Testing of Supervised Learning Software , 2016, AAAI Workshops.

[34]  Song Wang,et al.  DASE: Document-Assisted Symbolic Execution for Improving Automated Software Testing , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[35]  Ian Goodfellow,et al.  TensorFuzz: Debugging Neural Networks with Coverage-Guided Fuzzing , 2018, ICML.

[36]  Chao Shen,et al.  Audee: Automated Testing for Deep Learning Frameworks , 2020, 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[37]  Yuanyuan Zhou,et al.  aComment: mining annotations from comments and code to detect interrupt related concurrency bugs , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[38]  Tao Xie,et al.  Testing Untestable Neural Machine Translation: An Industrial Case , 2018, 2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion).

[39]  Wei Li,et al.  DeepBillboard: Systematic Physical-World Testing of Autonomous Driving Systems , 2018, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[40]  Suman Jana,et al.  DeepTest: Automated Testing of Deep-Neural-Network-Driven Autonomous Cars , 2017, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[41]  Hui Guo,et al.  Efficient Generation of Error-Inducing Floating-Point Inputs via Symbolic Execution , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[42]  Eric P. Xing,et al.  What If We Simply Swap the Two Text Fragments? A Straightforward yet Effective Way to Test the Robustness of Methods to Confounding Signals in Nature Language Inference Tasks , 2018, AAAI.

[43]  Liqun Sun,et al.  Metamorphic testing of driverless cars , 2019, Commun. ACM.

[44]  Koushik Sen,et al.  FuzzFactory: domain-specific fuzzing with waypoints , 2019, Proc. ACM Program. Lang..

[45]  Mohammed J. Zaki,et al.  Prism: An effective approach for frequent sequence mining via prime-block encoding , 2010, J. Comput. Syst. Sci..

[46]  Yuanyuan Zhou,et al.  /*icomment: bugs or bad comments?*/ , 2007, SOSP.

[47]  Yu Zhou,et al.  Analyzing APIs Documentation and Code to Detect Directive Defects , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).

[48]  Michael D. Ernst,et al.  Randoop: feedback-directed random testing for Java , 2007, OOPSLA '07.

[49]  Chao Zhang,et al.  CollAFL: Path Sensitive Fuzzing , 2018, 2018 IEEE Symposium on Security and Privacy (SP).

[50]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[51]  Xiangyu Zhang,et al.  C2S: translating natural language comments to formal program specifications , 2020, ESEC/SIGSOFT FSE.

[52]  Gabriele Bavota,et al.  Taxonomy of Real Faults in Deep Learning Systems , 2019, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[53]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[54]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[55]  Gadi Evron,et al.  Open Source Fuzzing Tools , 2007 .

[56]  Jinqiu Yang,et al.  A Study of Oracle Approximations in Testing Deep Learning Libraries , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[57]  Tao Xie,et al.  Inferring Resource Specifications from Natural Language API Documentation , 2009, 2009 IEEE/ACM International Conference on Automated Software Engineering.

[58]  Xiang Gao,et al.  Fuzz Testing based Data Augmentation to Improve Robustness of Deep Neural Networks , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[59]  Abhik Roychoudhury,et al.  Coverage-Based Greybox Fuzzing as Markov Chain , 2016, IEEE Transactions on Software Engineering.

[60]  Qiming Chen,et al.  PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth , 2001, Proceedings 17th International Conference on Data Engineering.

[61]  Yannis Smaragdakis,et al.  JCrasher: an automatic robustness tester for Java , 2004, Softw. Pract. Exp..

[62]  Yang Liu,et al.  Steelix: program-state based binary fuzzing , 2017, ESEC/SIGSOFT FSE.

[63]  Yifan Chen,et al.  An empirical study on TensorFlow program bugs , 2018, ISSTA.

[64]  Koushik Sen,et al.  CUTE: a concolic unit testing engine for C , 2005, ESEC/FSE-13.

[65]  Yue Zhao,et al.  DLFuzz: differential fuzzing testing of deep learning systems , 2018, ESEC/SIGSOFT FSE.