The Influence of Test Suite Properties on Automated Grading of Programming Exercises

Automated grading allows for the scalable assessment of large programming courses, often using test cases to determine the correctness of students' programs. However, test suites can vary in multiple ways, such as quality, size, and coverage. In this paper, we investigate how much test suites with varying properties can impact generated grades, and how these properties cause this impact. We conduct a study on artificial faulty programs that simulate students' programming mistakes and test suites generated from manually written tests. We find that these test suites generate greatly varying grades, with the standard deviation of grades for each fault typically representing ∼84% of the grades not apportioned to the fault. We show that different properties of test suites can influence the grades that they produce, with coverage typically making the greatest effect, and mutation score and the potentially redundant repeated coverage of lines also having a significant impact. We offer suggestions based on our findings to assist tutors with building grading test suites that assess students' code in a fair and consistent manner. These suggestions include ensuring that test suites have 100% coverage, avoiding unnecessarily recovering lines, and checking test suites using real or artificial faults.

[1]  Josep Silva,et al.  Automatic assessment of Java code , 2018, Comput. Lang. Syst. Struct..

[2]  Arie van Deursen,et al.  A test-suite diagnosability metric for spectrum-based fault localization approaches , 2017, ICSE.

[3]  Michelle Craig,et al.  Evaluating Test Suite Effectiveness and Assessing Student Code via Constraint Logic Programming , 2017, ITiCSE.

[4]  René Just,et al.  MAJOR: An efficient and extensible tool for mutation analysis in a Java compiler , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[5]  Lionel C. Briand,et al.  Using Mutation Analysis for Assessing and Comparing Testing Coverage Criteria , 2006, IEEE Transactions on Software Engineering.

[6]  Andreas Seitz,et al.  ArTEMiS: An Automatic Assessment Management System for Interactive Learning , 2018, SIGCSE.

[7]  Ellen Francine Barbosa,et al.  A Systematic Literature Review of Assessment Tools for Programming Assignments , 2016, 2016 IEEE 29th International Conference on Software Engineering Education and Training (CSEET).

[8]  René Just,et al.  The major mutation framework: efficient and scalable mutation analysis for Java , 2014, ISSTA 2014.

[9]  Gordon Fraser,et al.  Gamifying a Software Testing Course with Code Defenders , 2019, SIGCSE.

[10]  Gordon Fraser,et al.  Simulating Student Mistakes to Evaluate the Fairness of Automated Grading , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering Education and Training (ICSE-SEET).

[11]  Michael D. Ernst,et al.  Are mutants a valid substitute for real faults in software testing? , 2014, SIGSOFT FSE.

[12]  Gordon Fraser,et al.  Code Defenders: Crowdsourcing Effective Tests and Subtle Mutants with a Mutation Testing Game , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).