Learning to find naming issues with big code and small supervision

We introduce a new approach for finding and fixing naming issues in source code. The method is based on a careful combination of unsupervised and supervised procedures: (i) unsupervised mining of patterns from Big Code that express common naming idioms. Program fragments violating such idioms indicates likely naming issues, and (ii) supervised learning of a classifier on a small labeled dataset which filters potential false positives from the violations. We implemented our method in a system called Namer and evaluated it on a large number of Python and Java programs. We demonstrate that Namer is effective in finding naming mistakes in real world repositories with high precision (~70%). Perhaps surprisingly, we also show that existing deep learning methods are not practically effective and achieve low precision in finding naming issues (up to ~16%).

[1]  Charles A. Sutton,et al.  Learning natural coding conventions , 2014, SIGSOFT FSE.

[2]  Shaohua Wang,et al.  Improving bug detection via context-based code representation learning and attention-based neural networks , 2019, Proc. ACM Program. Lang..

[3]  Petar Tsankov,et al.  Statistical Deobfuscation of Android Applications , 2016, CCS.

[4]  Aditya Kanade,et al.  Neural Program Repair by Jointly Learning to Localize and Repair , 2019, ICLR.

[5]  Andreas Krause,et al.  Predicting Program Properties from "Big Code" , 2015, POPL.

[6]  Marc Brockschmidt,et al.  Learning to Represent Programs with Graphs , 2017, ICLR.

[7]  Koushik Sen,et al.  DeepBugs: a learning approach to name-based bug detection , 2018, Proc. ACM Program. Lang..

[8]  Yang Liu,et al.  Skyfire: Data-Driven Seed Generation for Fuzzing , 2017, 2017 IEEE Symposium on Security and Privacy (SP).

[9]  Laks V. S. Lakshmanan,et al.  Exploiting succinct constraints using FP-trees , 2002, SKDD.

[10]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[11]  Rishabh Singh,et al.  Global Relational Models of Source Code , 2020, ICLR.

[12]  Mislav Balunovic,et al.  Learning to Fuzz from Symbolic Execution with Application to Smart Contracts , 2019, CCS.

[13]  Yu Wang,et al.  Learning semantic program embeddings with graph interval neural network , 2020, Proc. ACM Program. Lang..

[14]  Dawson R. Engler,et al.  Z-Ranking: Using Statistical Analysis to Counter the Impact of Static Analysis Approximations , 2003, SAS.

[15]  Hongseok Yang,et al.  Resource-Aware Program Analysis Via Online Abstraction Coarsening , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[16]  Uri Alon,et al.  code2vec: learning distributed representations of code , 2018, Proc. ACM Program. Lang..

[17]  Yijun Yu,et al.  Exploring the Influence of Identifier Names on Code Quality: An Empirical Study , 2010, 2010 14th European Conference on Software Maintenance and Reengineering.

[18]  Petar Tsankov,et al.  Debin: Predicting Debug Information in Stripped Binaries , 2018, CCS.

[19]  Hongseok Yang,et al.  Learning a strategy for adapting a program analysis via bayesian optimisation , 2015, OOPSLA.

[20]  Abhik Roychoudhury,et al.  Coverage-Based Greybox Fuzzing as Markov Chain , 2016, IEEE Transactions on Software Engineering.

[21]  Yue Luo,et al.  Nomen est Omen: Exploring and Exploiting Similarities between Argument and Parameter Names , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[22]  Charles A. Sutton,et al.  Suggesting accurate method and class names , 2015, ESEC/SIGSOFT FSE.

[23]  Junfeng Yang,et al.  NEUZZ: Efficient Fuzzing with Neural Program Smoothing , 2018, 2019 IEEE Symposium on Security and Privacy (SP).

[24]  Thomas R. Gross,et al.  Detecting anomalies in the order of equally-typed method arguments , 2011, ISSTA '11.

[25]  Jan Vitek,et al.  DéjàVu: a map of code duplicates on GitHub , 2017, Proc. ACM Program. Lang..

[26]  Markus Püschel,et al.  Fast Numerical Program Analysis with Reinforcement Learning , 2018, CAV.

[27]  Alexander Aiken,et al.  Active learning of points-to specifications , 2017, PLDI.

[28]  Michael Pradel,et al.  Detecting argument selection defects , 2017, Proc. ACM Program. Lang..

[29]  Jan Eberhardt,et al.  Unsupervised learning of API aliasing specifications , 2019, PLDI.

[30]  Einar W. Høst,et al.  Debugging Method Names , 2009, ECOOP.

[31]  Dawson R. Engler,et al.  KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs , 2008, OSDI.

[32]  Charles A. Sutton,et al.  A Convolutional Attention Network for Extreme Summarization of Source Code , 2016, ICML.

[33]  Omer Levy,et al.  code2seq: Generating Sequences from Structured Representations of Code , 2018, ICLR.

[34]  Lin Tan,et al.  Finding patterns in static analysis alerts: improving actionable alert ranking , 2014, MSR 2014.

[35]  Uri Alon,et al.  A general path-based representation for predicting program properties , 2018, PLDI.

[36]  Martin Vechev,et al.  Adversarial Robustness for Code , 2020, ICML.

[37]  Rishabh Singh,et al.  Learn&Fuzz: Machine learning for input fuzzing , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[38]  Patrick Cousot,et al.  Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints , 1977, POPL.

[39]  Petar Tsankov,et al.  Inferring crypto API rules from code changes , 2018, PLDI.

[40]  Martin T. Vechev,et al.  Scalable taint specification inference with big code , 2019, PLDI.