Using Data Mining for Static Code Analysis of C

Static analysis of source code is one way to find bugs and problems in large software projects. Many approaches to static analysis have been proposed. We proposed a novel way of performing static analysis. Instead of methods based on semantic/logic analysis we apply machine learning directly to the problem. This has many benefits. Learning by example means trivial programmer adaptability (a problem with many other approaches), learning systems also has the advantage to be able to generalise and find problematic source code constructs that are not exactly as the programmer initially thought, to name a few. Due to the general interest in code quality and the availability of large open source code bases as test and development data, we believe this problem should be of interest to the larger data mining community. In this work we extend our previous approach and investigate a new way of doing feature selection and test the suitability of many different learning algorithms. This on a selection of problems we adapted from large publicly available open source projects. Many algorithms were much more successful than our previous proof-of-concept, and deliver practical levels of performance. This is clearly an interesting and minable problem.

[1]  Hannes Tribus,et al.  Static Code Features for a Machine Learning based Inspection : An approach for C , 2010 .

[2]  Dawson R. Engler,et al.  Checking system rules using system-specific, programmer-written compiler extensions , 2000, OSDI.

[3]  Witold Pedrycz,et al.  A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[4]  Yuriy Brun,et al.  Finding latent code errors via machine learning over program executions , 2004, Proceedings. 26th International Conference on Software Engineering.

[5]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[6]  Sarah Smith Heckman,et al.  A Model Building Process for Identifying Actionable Static Analysis Alerts , 2009, 2009 International Conference on Software Testing Verification and Validation.

[7]  Jochen Kreimer,et al.  Adaptive Detection of Design Flaws , 2005, LDTA@ETAPS.

[8]  Jacob West,et al.  Secure Programming with Static Analysis , 2007 .

[9]  Shari Lawrence Pfleeger,et al.  Software Metrics : A Rigorous and Practical Approach , 1998 .

[10]  Walter F. Tichy,et al.  Proceedings 25th International Conference on Software Engineering , 2003, 25th International Conference on Software Engineering, 2003. Proceedings..

[11]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[12]  Norman E. Fenton,et al.  A Critique of Software Defect Prediction Models , 1999, IEEE Trans. Software Eng..

[13]  Yue Jiang,et al.  Comparing design and code metrics for software quality prediction , 2008, PROMISE '08.

[14]  Venkata U. B. Challagulla,et al.  Empirical assessment of machine learning based software defect prediction techniques , 2005, 10th IEEE International Workshop on Object-Oriented Real-Time Dependable Systems.

[15]  Burak Turhan,et al.  Mining Software Data , 2007, 2007 IEEE 23rd International Conference on Data Engineering Workshop.

[16]  Dejan Baca Automated static code analysis : A tool for early vulnerability detection , 2009 .

[17]  Stefan Axelsson,et al.  Detecting Defects with an Interactive Code Review Tool Based on Visualisation and Machine Learning , 2009, SEKE.

[18]  Gunnar Rätsch,et al.  An Improvement of AdaBoost to Avoid Overfitting , 1998, ICONIP.

[19]  Qinbao Song,et al.  Software defect association mining and defect correction effort prediction , 2006, IEEE Transactions on Software Engineering.

[20]  Thomas Reidemeister,et al.  Detection and Diagnosis of Recurrent Faults in Software Systems by Invariant Analysis , 2008, 2008 11th IEEE High Assurance Systems Engineering Symposium.