论文信息 - Learning to find naming issues with big code and small supervision

Learning to find naming issues with big code and small supervision

We introduce a new approach for finding and fixing naming issues in source code. The method is based on a careful combination of unsupervised and supervised procedures: (i) unsupervised mining of patterns from Big Code that express common naming idioms. Program fragments violating such idioms indicates likely naming issues, and (ii) supervised learning of a classifier on a small labeled dataset which filters potential false positives from the violations. We implemented our method in a system called Namer and evaluated it on a large number of Python and Java programs. We demonstrate that Namer is effective in finding naming mistakes in real world repositories with high precision (~70%). Perhaps surprisingly, we also show that existing deep learning methods are not practically effective and achieve low precision in finding naming issues (up to ~16%).

Martin T. Vechev | Veselin Raychev | Jingxuan He | Cheng-Chun Lee

[1] Charles A. Sutton,et al. Learning natural coding conventions , 2014, SIGSOFT FSE.

[2] Shaohua Wang,et al. Improving bug detection via context-based code representation learning and attention-based neural networks , 2019, Proc. ACM Program. Lang..

[3] Petar Tsankov,et al. Statistical Deobfuscation of Android Applications , 2016, CCS.

[4] Aditya Kanade,et al. Neural Program Repair by Jointly Learning to Localize and Repair , 2019, ICLR.

[5] Andreas Krause,et al. Predicting Program Properties from "Big Code" , 2015, POPL.

[6] Marc Brockschmidt,et al. Learning to Represent Programs with Graphs , 2017, ICLR.

[7] Koushik Sen,et al. DeepBugs: a learning approach to name-based bug detection , 2018, Proc. ACM Program. Lang..

[8] Yang Liu,et al. Skyfire: Data-Driven Seed Generation for Fuzzing , 2017, 2017 IEEE Symposium on Security and Privacy (SP).

[9] Laks V. S. Lakshmanan,et al. Exploiting succinct constraints using FP-trees , 2002, SKDD.

[10] Jian Pei,et al. Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[11] Rishabh Singh,et al. Global Relational Models of Source Code , 2020, ICLR.

[12] Mislav Balunovic,et al. Learning to Fuzz from Symbolic Execution with Application to Smart Contracts , 2019, CCS.

[13] Yu Wang,et al. Learning semantic program embeddings with graph interval neural network , 2020, Proc. ACM Program. Lang..

[14] Dawson R. Engler,et al. Z-Ranking: Using Statistical Analysis to Counter the Impact of Static Analysis Approximations , 2003, SAS.

[15] Hongseok Yang,et al. Resource-Aware Program Analysis Via Online Abstraction Coarsening , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[16] Uri Alon,et al. code2vec: learning distributed representations of code , 2018, Proc. ACM Program. Lang..

[17] Yijun Yu,et al. Exploring the Influence of Identifier Names on Code Quality: An Empirical Study , 2010, 2010 14th European Conference on Software Maintenance and Reengineering.

[18] Petar Tsankov,et al. Debin: Predicting Debug Information in Stripped Binaries , 2018, CCS.

[19] Hongseok Yang,et al. Learning a strategy for adapting a program analysis via bayesian optimisation , 2015, OOPSLA.

[20] Abhik Roychoudhury,et al. Coverage-Based Greybox Fuzzing as Markov Chain , 2016, IEEE Transactions on Software Engineering.

[21] Yue Luo,et al. Nomen est Omen: Exploring and Exploiting Similarities between Argument and Parameter Names , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[22] Charles A. Sutton,et al. Suggesting accurate method and class names , 2015, ESEC/SIGSOFT FSE.

[23] Junfeng Yang,et al. NEUZZ: Efficient Fuzzing with Neural Program Smoothing , 2018, 2019 IEEE Symposium on Security and Privacy (SP).

[24] Thomas R. Gross,et al. Detecting anomalies in the order of equally-typed method arguments , 2011, ISSTA '11.

[25] Jan Vitek,et al. DéjàVu: a map of code duplicates on GitHub , 2017, Proc. ACM Program. Lang..

[26] Markus Püschel,et al. Fast Numerical Program Analysis with Reinforcement Learning , 2018, CAV.

[27] Alexander Aiken,et al. Active learning of points-to specifications , 2017, PLDI.

[28] Michael Pradel,et al. Detecting argument selection defects , 2017, Proc. ACM Program. Lang..

[29] Jan Eberhardt,et al. Unsupervised learning of API aliasing specifications , 2019, PLDI.

[30] Einar W. Høst,et al. Debugging Method Names , 2009, ECOOP.

[31] Dawson R. Engler,et al. KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs , 2008, OSDI.

[32] Charles A. Sutton,et al. A Convolutional Attention Network for Extreme Summarization of Source Code , 2016, ICML.

[33] Omer Levy,et al. code2seq: Generating Sequences from Structured Representations of Code , 2018, ICLR.

[34] Lin Tan,et al. Finding patterns in static analysis alerts: improving actionable alert ranking , 2014, MSR 2014.

[35] Uri Alon,et al. A general path-based representation for predicting program properties , 2018, PLDI.

[36] Martin Vechev,et al. Adversarial Robustness for Code , 2020, ICML.

[37] Rishabh Singh,et al. Learn&Fuzz: Machine learning for input fuzzing , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[38] Patrick Cousot,et al. Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints , 1977, POPL.

[39] Petar Tsankov,et al. Inferring crypto API rules from code changes , 2018, PLDI.

[40] Martin T. Vechev,et al. Scalable taint specification inference with big code , 2019, PLDI.