Syntactic Vs. Semantic similarity of Artificial and Real Faults in Mutation Testing Studies

Fault seeding is typically used in controlled studies to evaluate and compare test techniques. Central to these techniques lies the hypothesis that artificially seeded faults involve some form of realistic properties and thus provide realistic experimental results. In an attempt to strengthen realism, a recent line of research uses advanced machine learning techniques, such as deep learning and Natural Language Processing (NLP), to seed faults that look like (syntactically) real ones, implying that fault realism is related to syntactic similarity. This raises the question of whether seeding syntactically similar faults indeed results in semantically similar faults and more generally whether syntactically dissimilar faults are far away (semantically) from the real ones. We answer this question by employing 4 fault-seeding techniques (PiTest a popular mutation testing tool, IBIR a tool with manually crafted fault patterns, DeepMutation a learning-based fault seeded framework and CodeBERT a novel mutation testing tool that use code embeddings) and demonstrate that syntactic similarity does not reflect semantic similarity. We also show that 60%, 47%, 43% and 7% of the real faults of Defects4J V2 are semantically resembled by CodeBERT, PiTest, IBIR and DeepMutation faults. We then perform an objective comparison between the techniques and find that CodeBERT and PiTest have similar fault detection capabilities that subsume IBIR and DeepMutation, and that IBIR is the most cost-effective technique. Moreover, the overall fault detection of PiTest, CodeBERT, IBIR and DeepMutation was, on average, 54%, 53%, 37% and 7%.

[1]  Michael Pradel,et al.  Semantic bug seeding: a learning-based approach for creating realistic bugs , 2021, ESEC/SIGSOFT FSE.

[2]  Premkumar T. Devanbu,et al.  On the naturalness of software , 2016, Commun. ACM.

[3]  A. Ochiai Zoogeographical Studies on the Soleoid Fishes Found in Japan and its Neighbouring Regions-III , 1957 .

[4]  A. Jefferson Offutt,et al.  Mutant Subsumption Graphs , 2014, 2014 IEEE Seventh International Conference on Software Testing, Verification and Validation Workshops.

[5]  Anthony Ventresque,et al.  Demo: PIT a Practical Mutation Testing Tool for Java , 2016 .

[6]  Thomas W. Reps,et al.  The care and feeding of wild-caught mutants , 2017, ESEC/SIGSOFT FSE.

[7]  Alex Groce,et al.  Mutations: How Close are they to Real Faults? , 2014, 2014 IEEE 25th International Symposium on Software Reliability Engineering.

[8]  Yves Le Traon,et al.  An Empirical Study on Mutation, Statement and Branch Coverage Fault Revelation That Avoids the Unreliable Clean Program Assumption , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).

[9]  A. Jefferson Offutt,et al.  Establishing Theoretical Minimal Sets of Mutants , 2014, 2014 IEEE Seventh International Conference on Software Testing, Verification and Validation.

[10]  Michael D. Ernst,et al.  Are mutants a valid substitute for real faults in software testing? , 2014, SIGSOFT FSE.

[11]  Yves Le Traon,et al.  Chapter Six - Mutation Testing Advances: An Analysis and Survey , 2019, Adv. Comput..

[12]  Mark Harman,et al.  Higher Order Mutation Testing , 2009, Inf. Softw. Technol..

[13]  Gabriele Bavota,et al.  DeepMutation: A Neural Mutation Tool , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering: Companion Proceedings (ICSE-Companion).

[14]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[15]  S. Niwattanakul,et al.  Using of Jaccard Coefficient for Keywords Similarity , 2022 .

[16]  Shin Yoo,et al.  Are Mutation Scores Correlated with Real Fault Detection? A Large Scale Empirical Study on the Relationship Between Mutants and Real Faults , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[17]  Samuel B. Williams,et al.  ASSOCIATION FOR COMPUTING MACHINERY , 2000 .

[18]  Lionel C. Briand,et al.  Using Mutation Analysis for Assessing and Comparing Testing Coverage Criteria , 2006, IEEE Transactions on Software Engineering.

[19]  A. Jefferson Offutt,et al.  A semantic model of program faults , 1996, ISSTA '96.

[20]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[21]  A. Jefferson Offutt,et al.  Investigations of the software testing coupling effect , 1992, TSEM.

[22]  Jian Zhou,et al.  Where should the bugs be fixed? More accurate information retrieval-based bug localization based on bug reports , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[23]  Mark Harman,et al.  Source code analysis and manipulation , 2002, Inf. Softw. Technol..

[24]  Michael D. Ernst,et al.  Defects4J: a database of existing faults to enable controlled testing studies for Java programs , 2014, ISSTA 2014.

[25]  Yves Le Traon,et al.  Assessing and Improving the Mutation Testing Practice of PIT , 2016, 2017 IEEE International Conference on Software Testing, Verification and Validation (ICST).

[26]  Lionel C. Briand,et al.  Is mutation an appropriate tool for testing experiments? , 2005, ICSE.

[27]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[28]  Yves Le Traon,et al.  Mutant Quality Indicators , 2018, 2018 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW).

[29]  Yves Le Traon,et al.  Killing Stubborn Mutants with Symbolic Execution , 2020, ArXiv.

[30]  Yves Le Traon,et al.  Threats to the validity of mutation-based test assessment , 2016, ISSTA.

[31]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[32]  Shamik Sural,et al.  Similarity between Euclidean and cosine angle distance for nearest neighbor queries , 2004, SAC '04.

[33]  Gabriele Bavota,et al.  Learning How to Mutate Source Code from Bug-Fixes , 2018, 2019 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[34]  Diana Inkpen,et al.  Semantic text similarity using corpus-based word similarity and string similarity , 2008, ACM Trans. Knowl. Discov. Data.

[35]  A. Jefferson Offutt,et al.  Mutation 2000: uniting the orthogonal , 2001 .

[36]  Xiaocheng Feng,et al.  CodeBERT: A Pre-Trained Model for Programming and Natural Languages , 2020, EMNLP.

[37]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[38]  David Hovemeyer,et al.  Finding bugs is easy , 2004, SIGP.

[39]  Koushik Sen,et al.  DeepBugs: a learning approach to name-based bug detection , 2018, Proc. ACM Program. Lang..

[40]  Claire Le Goues,et al.  GenProg: A Generic Method for Automatic Software Repair , 2012, IEEE Transactions on Software Engineering.

[41]  Yves Le Traon,et al.  How effective are mutation testing tools? An empirical analysis of Java mutation testing tools with manual analysis and real faults , 2017, Empirical Software Engineering.

[42]  A. Jefferson Offutt,et al.  Analyzing the validity of selective mutation with dominator mutants , 2016, SIGSOFT FSE.

[43]  Yves Le Traon,et al.  IBIR: Bug Report driven Fault Injection , 2020, ArXiv.

[44]  A. Vargha,et al.  A Critique and Improvement of the CL Common Language Effect Size Statistics of McGraw and Wong , 2000 .

[45]  Richard J. Lipton,et al.  Hints on Test Data Selection: Help for the Practicing Programmer , 1978, Computer.