LLM4CBI: Taming LLMs to Generate Effective Test Programs for Compiler Bug Isolation

Compiler bugs pose a significant threat to safety-critical applications, and promptly and effectively isolating these bugs is crucial for assuring the quality of compilers. However, the limited availability of debugging information on reported bugs complicates the compiler bug isolation task. Existing compiler bug isolation approaches typically convert the problem into a test program mutation problem, but they are still limited by ineffective mutation strategies or high human effort requirements. Drawing inspiration from the recent progress of pre-trained Large Language Models (LLMs), such as ChatGPT, in code generation, we propose a new approach named LLM4CBI to tame LLMs to generate effective test programs for compiler bug isolation. However, using LLMs directly for test program mutation may not yield the desired results due to the challenges associated with formulating precise prompts and selecting specialized prompts. To overcome the challenges, three new components are designed in LLM4CBI. (1) LLM4CBI utilizes a program complexity-guided prompt production component, which leverages data and control flow analysis to identify the most valuable variables and locations in programs for mutation. (2) LLM4CBI employs a memorized prompt selection component, which adopts reinforcement learning to select specialized prompts for mutating test programs continuously. (3) A test program validation component is proposed to select specialized feedback prompts to avoid repeating the same mistakes during the mutation process. Compared with the state-of-the-art approaches (DiWi and RecBi), our evaluation demonstrates the advantages of LLM4CBI: It isolates more bugs, ranging from 13.6% to 90.9% in various settings, than the other approaches. Additionally, we demonstrate that LLM4CBI is extensible, allowing for easy integration with other LLMs.

[1]  Carolyn Jane Anderson,et al.  MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation , 2023, IEEE Transactions on Software Engineering.

[2]  Julien Launay,et al.  The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only , 2023, ArXiv.

[3]  Shangqing Liu,et al.  The Scope of ChatGPT in Software Engineering: A Thorough Investigation , 2023, ArXiv.

[4]  Nghi D. Q. Bui,et al.  CodeT5+: Open Code Large Language Models for Code Understanding and Generation , 2023, EMNLP.

[5]  Lingming Zhang,et al.  Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation , 2023, ArXiv.

[6]  Lingming Zhang,et al.  Automated Program Repair in the Era of Large Pre-trained Language Models , 2023, 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE).

[7]  Can Xu,et al.  WizardLM: Empowering Large Language Models to Follow Complex Instructions , 2023, ArXiv.

[8]  Lingxiao Jiang,et al.  Detecting C++ Compiler Front-End Bugs via Grammar Mutation and Differential Testing , 2023, IEEE Transactions on Reliability.

[9]  Naman Goyal,et al.  LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[10]  Douglas C. Schmidt,et al.  A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT , 2023, ArXiv.

[11]  S. Debray,et al.  Automatically Localizing Dynamic Code Generation Bugs in JIT Compiler Back-End , 2023, CC.

[12]  B. Luo,et al.  An Empirical Comparison of Pre-Trained Models of Source Code , 2023, 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE).

[13]  Lingming Zhang,et al.  Large Language Models Are Zero-Shot Fuzzers: Fuzzing Deep-Learning Libraries via Large Language Models , 2022, ISSTA.

[14]  Brendan Dolan-Gavitt,et al.  Benchmarking Large Language Models for Automated Verilog RTL Code Generation , 2022, 2023 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[15]  Chunyang Chen,et al.  Fill in the Blank: Context-aware Automated Text Input Generation for Mobile GUI Testing , 2022, 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE).

[16]  Lingxiao Jiang,et al.  Remgen: Remanufacturing a Random Program Generator for Compiler Testing , 2022, 2022 IEEE 33rd International Symposium on Software Reliability Engineering (ISSRE).

[17]  Yuting Chen,et al.  LocSeq: Automated Localization for Compiler Optimization Sequence Bugs of LLVM , 2022, IEEE Transactions on Reliability.

[18]  Shin Hwei Tan,et al.  Automated Repair of Programs from Large Language Models , 2022, 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE).

[19]  Sida I. Wang,et al.  InCoder: A Generative Model for Code Infilling and Synthesis , 2022, ICLR.

[20]  S. Savarese,et al.  CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis , 2022, ICLR.

[21]  Ming Wen,et al.  Isolating Compiler Optimization Faults via Differentiating Finer-grained Options , 2022, IEEE International Conference on Software Analysis, Evolution, and Reengineering.

[22]  Frank F. Xu,et al.  A systematic evaluation of large language models of code , 2022, MAPS@PLDI.

[23]  Hiroaki Hayashi,et al.  Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing , 2021, ACM Comput. Surv..

[24]  Saumya Debray,et al.  Automated bug localization in JIT compilers , 2021, VEE.

[25]  Kai-Wei Chang,et al.  Unified Pre-training for Program Understanding and Generation , 2021, NAACL.

[26]  Laria Reynolds,et al.  Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm , 2021, CHI Extended Abstracts.

[27]  He Jiang,et al.  CTOS: Compiler Testing for Optimization Sequences of LLVM , 2021, IEEE Transactions on Software Engineering.

[28]  Ming Zhou,et al.  GraphCodeBERT: Pre-training Code Representations with Data Flow , 2020, ICLR.

[29]  Junjie Chen,et al.  Enhanced Compiler Bug Isolation via Memoized Search , 2020, 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[30]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[31]  Ting Liu,et al.  CodeBERT: A Pre-Trained Model for Programming and Natural Languages , 2020, FINDINGS.

[32]  Yingfei Xiong,et al.  A Survey of Compiler Testing , 2019 .

[33]  Alex Groce,et al.  Using mutants to help developers distinguish and debug (compiler) faults , 2020, Softw. Test. Verification Reliab..

[34]  Rongxin Wu,et al.  Historical Spectrum Based Fault Localization , 2019, IEEE Transactions on Software Engineering.

[35]  Lingming Zhang,et al.  Compiler bug isolation via effective witness test program generation , 2019, ESEC/SIGSOFT FSE.

[36]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[37]  Yang Liu,et al.  Fine-tune BERT for Extractive Summarization , 2019, ArXiv.

[38]  Alex Groce,et al.  Causal Distance-Metric-Based Assistance for Debugging after Compiler Fuzzing , 2018, 2018 IEEE 29th International Symposium on Software Reliability Engineering (ISSRE).

[39]  Alvin Cheung,et al.  Mapping Language to Code in Programmatic Context , 2018, EMNLP.

[40]  Chung-Kil Hur,et al.  Taming undefined behavior in LLVM , 2017, PLDI.

[41]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[42]  Michael D. Ernst,et al.  Evaluating and Improving Fault Localization , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).

[43]  Rui Abreu,et al.  A Survey on Software Fault Localization , 2016, IEEE Transactions on Software Engineering.

[44]  Zhendong Su,et al.  Toward understanding compiler bugs in GCC and LLVM , 2016, ISSTA.

[45]  Jonathan I. Maletic,et al.  srcSlice: A Tool for Efficient Static Forward Slicing , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C).

[46]  Wang Ling,et al.  Latent Predictor Networks for Code Generation , 2016, ACL.

[47]  Ian G. Harris,et al.  GLAsT: Learning formal grammars to translate natural language specifications into hardware assertions , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[48]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[49]  Nikolai Kosmatov,et al.  Frama-C: A software analysis perspective , 2015, Formal Aspects of Computing.

[50]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[51]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[52]  Armando Solar-Lezama,et al.  Towards optimization-safe systems: analyzing the impact of undefined behavior , 2013, SOSP.

[53]  Robert Babuska,et al.  A Survey of Actor-Critic Reinforcement Learning: Standard and Natural Policy Gradients , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[54]  David Lo,et al.  Interactive fault localization leveraging simple user feedback , 2012, 2012 28th IEEE International Conference on Software Maintenance (ICSM).

[55]  Xiaofeng Xu,et al.  Ties within Fault Localization rankings: Exposing and Addressing the Problem , 2011, Int. J. Softw. Eng. Knowl. Eng..

[56]  Xuejun Yang,et al.  Finding and understanding bugs in C compilers , 2011, PLDI '11.

[57]  Rajiv Gupta,et al.  Fault localization using value replacement , 2008, ISSTA '08.

[58]  Mary Jean Harrold,et al.  An empirical study of the effects of test-suite reduction on fault localization , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[59]  A.J.C. van Gemund,et al.  On the Accuracy of Spectrum-based Fault Localization , 2007, Testing: Academic and Industrial Conference Practice and Research Techniques - MUTATION (TAICPART-MUTATION 2007).

[60]  A. Aho,et al.  Compilers: Principles, Techniques, and Tools (2nd Edition) , 2006 .

[61]  George C. Necula,et al.  Type-based verification of sssembly language for compiler debugging , 2005, TLDI '05.

[62]  A. Zeller Isolating cause-effect chains from computer programs , 2002, SIGSOFT '02/FSE-10.

[63]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[64]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[65]  Anas N. Al-Rabadi,et al.  A comparison of modified reconstructability analysis and Ashenhurst‐Curtis decomposition of Boolean functions , 2004 .

[66]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[67]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[68]  Vijay R. Konda,et al.  Actor-Critic Algorithms , 1999, NIPS.