Learning to Find Bugs and Code Quality Problems - What Worked and What not?

The recent growth of open source repositories and deep learning models brought big promises for the next generation of programming tools that can automate or significantly improve the software development process. Yet, such tools are still rare and the machine learning components in them are not always apparent to their users. The current most useful techniques in machine learning for code are also not coming from the organizations such as Microsoft, Google, DeepMind, Facebook, OpenAI or nVidia that invested the most in deep neural techniques such as huge neural networks. This probably means that either many of these coding problems are significantly different from other hot topics in deep learning such as image processing or that it is much more difficult to collect datasets that would result in similarly successful tools. In this work, we study the results in the literature on the topic and discuss ways to address these shortcomings.

[1]  Hongyu Zhang,et al.  Language Modelling for Source Code with Transformer-XL , 2020, ArXiv.

[2]  David Hovemeyer,et al.  Finding bugs is easy , 2004, SIGP.

[3]  Andreas Krause,et al.  Learning programs from noisy data , 2016, POPL.

[4]  Martin T. Vechev,et al.  Program Synthesis for Character Level Language Modeling , 2016, ICLR.

[5]  Koushik Sen,et al.  DeepBugs: a learning approach to name-based bug detection , 2018, Proc. ACM Program. Lang..

[6]  Martin T. Vechev,et al.  Probabilistic model for code with decision trees , 2016, OOPSLA.

[7]  Xiaocheng Feng,et al.  CodeBERT: A Pre-Trained Model for Programming and Natural Languages , 2020, EMNLP.

[8]  Baishakhi Ray,et al.  Deep Learning Based Vulnerability Detection: Are We There Yet? , 2020, IEEE Transactions on Software Engineering.

[9]  Peter Thiemann,et al.  Type Analysis for JavaScript , 2009, SAS.

[10]  Jan Eberhardt,et al.  Unsupervised learning of API aliasing specifications , 2019, PLDI.

[11]  Marc Brockschmidt,et al.  Learning to Represent Programs with Graphs , 2017, ICLR.

[12]  Steve McConnell,et al.  Code Complete, Second Edition , 2004 .

[13]  Rishabh Singh,et al.  Global Relational Models of Source Code , 2020, ICLR.

[14]  Yana Momchilova Mileva,et al.  Mining Evolution of Object Usage , 2011, ECOOP.

[15]  Patrick Cousot,et al.  A static analyzer for large safety-critical software , 2003, PLDI '03.

[16]  N. Nagappan,et al.  Use of relative code churn measures to predict system defect density , 2005, Proceedings. 27th International Conference on Software Engineering, 2005. ICSE 2005..

[17]  Samuel B. Williams,et al.  ASSOCIATION FOR COMPUTING MACHINERY , 2000 .

[18]  Hongseok Yang,et al.  Automatically generating features for learning program analysis heuristics for C-like languages , 2017, Proc. ACM Program. Lang..

[19]  Andreas Krause,et al.  Predicting Program Properties from "Big Code" , 2015, POPL.

[20]  Petar Tsankov,et al.  Statistical Deobfuscation of Android Applications , 2016, CCS.

[21]  Lionel C. Briand,et al.  Web Application Vulnerability Prediction Using Hybrid Program Analysis and Machine Learning , 2015, IEEE Transactions on Dependable and Secure Computing.

[22]  Gábor Lugosi,et al.  Introduction to Statistical Learning Theory , 2004, Advanced Lectures on Machine Learning.

[23]  Yannis Smaragdakis,et al.  Using Datalog for Fast and Easy Program Analysis , 2010, Datalog.

[24]  Zheng Gao,et al.  Typilus: neural type hints , 2020, PLDI.

[25]  Miltiadis Allamanis,et al.  The adverse effects of code duplication in machine learning models of code , 2018, Onward!.

[26]  Martin T. Vechev,et al.  Scalable taint specification inference with big code , 2019, PLDI.

[27]  Petar Tsankov,et al.  Debin: Predicting Debug Information in Stripped Binaries , 2018, CCS.

[28]  Alessandra Gorla,et al.  Mining Android Apps for Anomalies , 2015, The Art and Science of Analyzing Software Data.

[29]  Michael Pradel,et al.  Extracting Taint Specifications for JavaScript Libraries , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[30]  Dawson R. Engler,et al.  Bugs as deviant behavior: a general approach to inferring errors in systems code , 2001, SOSP.

[31]  Petar Tsankov,et al.  Inferring crypto API rules from code changes , 2018, PLDI.

[32]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.