An Empirical Assessment of Machine Learning Approaches for Triaging Reports of a Java Static Analysis Tool

Despite their ability to detect critical bugs in software, developers consider high false positive rates to be a key barrier to using static analysis tools in practice. To improve the usability of these tools, researchers have recently begun to apply machine learning techniques to classify and filter false positive analysis reports. Although initial results have been promising, the long-term potential and best practices for this line of research are unclear due to the lack of detailed, large-scale empirical evaluation. To partially address this knowledge gap, we present a comparative empirical study of four machine learning techniques, namely hand-engineered features, bag of words, recurrent neural networks, and graph neural networks, for classifying false positives, using multiple ground-truth program sets. We also introduce and evaluate new data preparation routines for recurrent neural networks and node representations for graph neural networks, and show that these routines can have a substantial positive impact on classification accuracy. Overall, our results suggest that recurrent neural networks (which learn over a program's source code) outperform the other subject techniques, although interesting tradeoffs are present among all techniques. Our observations provide insight into the future research needed to speed the adoption of machine learning approaches in practice.

[1]  David W. Binkley,et al.  Program slicing , 2008, 2008 Frontiers of Software Maintenance.

[2]  Kwangkeun Yi,et al.  Taming False Alarms from a Domain-Unaware C Analyzer by a Bayesian Statistical Post Analysis , 2005, SAS.

[3]  Andreas Krause,et al.  Predicting Program Properties from "Big Code" , 2015, POPL.

[4]  Dawson R. Engler,et al.  Z-Ranking: Using Statistical Analysis to Counter the Impact of Static Analysis Approximations , 2003, SAS.

[5]  Mukund Raghothaman,et al.  User-guided program reasoning using Bayesian inference , 2018, PLDI.

[6]  Premkumar T. Devanbu,et al.  On the localness of software , 2014, SIGSOFT FSE.

[7]  Danilo P. Mandic,et al.  Recurrent Neural Networks for Prediction: Learning Algorithms, Architectures and Stability , 2001 .

[8]  Guy Erez guyerez Generating Concrete Counterexamples for Sound Abstract Interpretation , 2004 .

[9]  Anh Tuan Nguyen,et al.  A statistical semantic language model for source code , 2013, ESEC/FSE 2013.

[10]  Xin Zhang,et al.  Combining the logical and the probabilistic in program analysis , 2017, MAPL@PLDI.

[11]  Charles A. Sutton,et al.  Parameter-free probabilistic API mining across GitHub , 2015, SIGSOFT FSE.

[12]  Sriram K. Rajamani,et al.  Counterexample Driven Refinement for Abstract Interpretation , 2006, TACAS.

[13]  Ashish Sureka,et al.  Detecting Duplicate Bug Report Using Character N-Gram-Based Features , 2010, 2010 Asia Pacific Software Engineering Conference.

[14]  F. Scarselli,et al.  A new model for learning in graph domains , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[15]  Letha H. Etzkorn,et al.  Bug localization using latent Dirichlet allocation , 2010, Inf. Softw. Technol..

[16]  Andreas Prlic,et al.  Sequence analysis , 2003 .

[17]  Jürgen Schmidhuber,et al.  Learning to forget: continual prediction with LSTM , 1999 .

[18]  M. Wegman,et al.  Global value numbers and redundant computations , 1988, POPL '88.

[19]  Xavier Rival,et al.  Understanding the Origin of Alarms in Astrée , 2005, SAS.

[20]  Xiaodong Gu,et al.  Deep API learning , 2016, SIGSOFT FSE.

[21]  Premkumar T. Devanbu,et al.  A Survey of Machine Learning for Big Code and Naturalness , 2017, ACM Comput. Surv..

[22]  Regina Barzilay,et al.  Using Semantic Unification to Generate Regular Expressions from Natural Language , 2013, NAACL.

[23]  Omer Levy,et al.  word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method , 2014, ArXiv.

[24]  Eran Yahav,et al.  Code completion with statistical language models , 2014, PLDI.

[25]  Sarah Smith Heckman Adaptive Probabilistic Model for Ranking Code-Based Static Analysis Alerts , 2007, 29th International Conference on Software Engineering (ICSE'07 Companion).

[26]  Kwang-Moo Choe,et al.  Filtering false alarms of buffer overflow analysis using SMT solvers , 2010, Inf. Softw. Technol..

[27]  Laurie Williams,et al.  A systematic model building process for predicting actionable static analysis alerts , 2009 .

[28]  Haiyun Xu,et al.  A Framework for Combining and Ranking Static Analysis Tool Findings Based on Tool Performance Statistics , 2017, 2017 IEEE International Conference on Software Quality, Reliability and Security Companion (QRS-C).

[29]  Charles A. Sutton,et al.  Suggesting accurate method and class names , 2015, ESEC/SIGSOFT FSE.

[30]  Ondrej Lhoták,et al.  In defense of soundiness , 2015, Commun. ACM.

[31]  Pietro Ferrara,et al.  Security Analysis of the OWASP Benchmark with Julia , 2017, ITASEC.

[32]  Martin White,et al.  Deep learning code fragments for code clone detection , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[33]  Xiao Ma,et al.  From Word Embeddings to Document Similarities for Improved Information Retrieval in Software Engineering , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[34]  Scott Moore,et al.  Exploring and enforcing security guarantees via program dependence graphs , 2015, PLDI.

[35]  Andreas Prlic,et al.  BioJava: an open-source framework for bioinformatics in 2012 , 2012, Bioinform..

[36]  Meng Zhang,et al.  Neural Network Methods for Natural Language Processing , 2017, Computational Linguistics.

[37]  Song Wang,et al.  Is there a "golden" feature set for static warning identification?: an experimental evaluation , 2018, ESEM.

[38]  Wang Ling,et al.  Latent Predictor Networks for Code Generation , 2016, ACL.

[39]  Ah Chung Tsoi,et al.  The Graph Neural Network Model , 2009, IEEE Transactions on Neural Networks.

[40]  Adam A. Porter,et al.  Learning a classifier for false positive error reports emitted by static code analysis tools , 2017, MAPL@PLDI.

[41]  Marco Pistoia,et al.  ALETHEIA: Improving the Usability of Static Security Analysis , 2014, CCS.

[42]  Richard S. Zemel,et al.  Gated Graph Sequence Neural Networks , 2015, ICLR.

[43]  Amer Diwan,et al.  The DaCapo benchmarks: java benchmarking development and analysis , 2006, OOPSLA '06.

[44]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[45]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[46]  Truyen Tran,et al.  A deep language model for software code , 2016, FSE 2016.

[47]  Marc Brockschmidt,et al.  Learning to Represent Programs with Graphs , 2017, ICLR.

[48]  Hasan Sözer,et al.  Automated Classification of Static Code Analysis Alerts: A Case Study , 2013, 2013 IEEE International Conference on Software Maintenance.

[49]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[50]  Robert W. Bowdidge,et al.  Why don't software developers use static analysis tools to find bugs? , 2013, 2013 35th International Conference on Software Engineering (ICSE).