Detecting Defects with an Interactive Code Review Tool Based on Visualisation and Machine Learning

Code review is often suggested as a means of improving code quality. Since humans are poor at repetitive tasks, some form of tool support is valuable. To that end we developed a prototype tool to illustrate the novel idea of applying machine learning (based on Normalised Compression Distance) to the problem of static analysis of source code. Since this tool learns by example, it is rivially programmer adaptable. As machine learning algorithms are notoriously difficult to understand operationally (they are opaque) we applied information visualisation to the results of the learner. In order to validate the approach we applied the prototype to source code from the open-source project Samba and from an industrial, telecom software system. Our results showed that the tool did indeed correctly find and classify problematic sections of code based on training examples.

[1]  B. Marx The Visual Display of Quantitative Information , 1985 .

[2]  Yuriy Brun,et al.  Finding latent code errors via machine learning over program executions , 2004, Proceedings. 26th International Conference on Software Engineering.

[3]  Tony Gorschek,et al.  Searching for Cognitively Diverse Tests: Towards Universal Test Diversity Metrics , 2008, 2008 IEEE International Conference on Software Testing Verification and Validation Workshop.

[4]  Martin Höst,et al.  Evaluation of code review methods through interviews and experimentation , 2000, J. Syst. Softw..

[5]  Raffaele Giancarlo,et al.  Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment , 2007, BMC Bioinformatics.

[6]  Michael A. Howard,et al.  A process for performing security code reviews , 2006, IEEE Security & Privacy.

[7]  James M. Rehg,et al.  Active learning for automatic classification of software behavior , 2004, ISSTA '04.

[8]  Rudi Cilibrasi,et al.  Statistical inference through data compression , 2007 .

[9]  Stefan Axelsson,et al.  Combining a bayesian classifier with visualisation: understanding the IDS , 2004, VizSEC/DMSEC '04.

[10]  Dawson R. Engler,et al.  Checking system rules using system-specific, programmer-written compiler extensions , 2000, OSDI.

[11]  Eamonn J. Keogh,et al.  Towards parameter-free data mining , 2004, KDD.

[12]  Matthias Dehmer,et al.  Information Theory and Statistical Learning , 2010 .

[13]  P. Fayers,et al.  The Visual Display of Quantitative Information , 1990 .

[14]  Xin Chen,et al.  Shared information and program plagiarism detection , 2004, IEEE Transactions on Information Theory.

[15]  Junfeng Yang,et al.  Correlation exploitation in error ranking , 2004, SIGSOFT '04/FSE-12.

[16]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[17]  Manuel Cebrián,et al.  The Normalized Compression Distance Is Resistant to Noise , 2007, IEEE Transactions on Information Theory.