Generating High-Precision Feedback for Programming Syntax Errors using Large Language Models

Large language models (LLMs), such as Codex, hold great promise in enhancing programming education by automatically generating feedback for students. We investigate using LLMs to generate feedback for fixing syntax errors in Python programs, a key scenario in introductory programming. More concretely, given a student's buggy program, our goal is to generate feedback comprising a fixed program along with a natural language explanation describing the errors/fixes, inspired by how a human tutor would give feedback. While using LLMs is promising, the critical challenge is to ensure high precision in the generated feedback, which is imperative before deploying such technology in classrooms. The main research question we study is: Can we develop LLMs-based feedback generation techniques with a tunable precision parameter, giving educators quality control over the feedback that students receive? To this end, we introduce PyFiXV, our technique to generate high-precision feedback powered by Codex. The key idea behind PyFiXV is to use a novel run-time validation mechanism to decide whether the generated feedback is suitable for sharing with the student; notably, this validation mechanism also provides a precision knob to educators. We perform an extensive evaluation using two real-world datasets of Python programs with syntax errors and show the efficacy of PyFiXV in generating high-precision feedback.

[1]  Austin Z. Henley,et al.  What Is Your Biggest Pain Point?: An Investigation of CS Instructor Obstacles, Workarounds, and Desires , 2023, SIGCSE.

[2]  Juho Leinonen,et al.  Experiences from Using Code Explanations Generated by Large Language Models in a Web Software Development E-Book , 2022, SIGCSE.

[3]  Brett A. Becker,et al.  Using Large Language Models to Enhance Programming Error Messages , 2022, SIGCSE.

[4]  A. Singla,et al.  Adaptive Scaffolding in Block-Based Programming via Synthesizing New Tasks as Pop Quizzes , 2023, AIED.

[5]  Sumit Gulwani,et al.  Repairing Bugs in Python Assignments Using Large Language Models , 2022, ArXiv.

[6]  Sumit Gulwani,et al.  Repair Is Nearly Generation: Multilingual Program Repair with LLMs , 2022, AAAI.

[7]  A. Tiwari,et al.  Neurosymbolic repair for low-code formula languages , 2022, Proc. ACM Program. Lang..

[8]  Kentaro Inui,et al.  Balancing Cost and Quality: An Exploration of Human-in-the-loop Frameworks for Automated Short Answer Scoring , 2022, AIED.

[9]  Juho Leinonen,et al.  Automatic Generation of Programming Exercises and Code Explanations Using Large Language Models , 2022, ICER.

[10]  Shin Hwei Tan,et al.  Automated Repair of Programs from Large Language Models , 2022, 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE).

[11]  C. Piech,et al.  The AI Teacher Test: Measuring the Pedagogical Ability of Blender and GPT-3 in Educational Dialogues , 2022, EDM.

[12]  Brett A. Becker,et al.  The Robots Are Coming: Exploring the Implications of OpenAI Codex on Introductory Programming , 2022, ACE.

[13]  Cherepanov,et al.  Competition-level code generation with AlphaCode , 2022, Science.

[14]  February Friday Evaluating Large Language Models , 2022 .

[15]  Brett A. Becker What does saying that 'programming is hard' really say, and about whom? , 2021, Commun. ACM.

[16]  Wojciech Zaremba,et al.  Evaluating Large Language Models Trained on Code , 2021, ArXiv.

[17]  J. Tenenbaum,et al.  Program Synthesis with Pragmatic Communication , 2020, NeurIPS.

[18]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[19]  Bill Manaris,et al.  Tell Me What's Wrong: A Python IDE with Error Messages , 2020, SIGCSE.

[20]  Rahul Gupta,et al.  Deep Reinforcement Learning for Syntactic Error Repair in Student Programs , 2019, AAAI.

[21]  Nicholas Lytle,et al.  Toward Data-Driven Example Feedback for Novice Programming , 2019, EDM.

[22]  Tobias Kohn,et al.  The Error Behind The Message: Finding the Cause of Error Messages in Python , 2019, SIGCSE.

[23]  Myle Ott,et al.  Understanding Back-Translation at Scale , 2018, EMNLP.

[24]  Pushmeet Kohli,et al.  Neuro-Symbolic Program Corrector for Introductory Programming Assignments , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[25]  Sumit Gulwani,et al.  Automated clustering and program repair for introductory programming assignments , 2016, PLDI.

[26]  John Homer,et al.  On Novices' Interaction with Compiler Error Messages: A Human Factors Approach , 2017, ICER.

[27]  Other Contributors Are Indicated Where They Contribute Python Software Foundation , 2017 .

[28]  Björn Hartmann,et al.  Writing Reusable Code Feedback at Scale with Mixed-Initiative Program Synthesis , 2017, L@S.

[29]  Brett A. Becker An Effective Approach to Enhancing Compiler Error Messages , 2016, SIGCSE.

[30]  M. Warrens Five ways to look at Cohen's kappa , 2015 .

[31]  Sumit Gulwani,et al.  Automated feedback generation for introductory programming assignments , 2012, PLDI.

[32]  Scott R. Klemmer,et al.  What would other programmers do: suggesting solutions to error messages , 2010, CHI.

[33]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[34]  K. Pearson On the χ 2 Test of Goodness of Fit , 1922 .