Mining Idioms in the Wild

Existing code repositories contain numerous instances of code patterns that are idiomatic ways of accomplishing a particular programming task. Sometimes, the programming language in use supports specific operators or APIs that can express the same idiomatic imperative code much more succinctly. However, those code patterns linger in repositories because the developers may be unaware of the new APIs, or just have not got around to it. Detection of idiomatic code can also point to the need for new APIs. We share our experiences in mine idiomatic patterns from the Hack repo at Facebook. We found that existing techniques either cannot identify meaningful patterns from syntax trees or require test-suite-based dynamic analysis to incorporate semantic properties to mine useful patterns. The key insight of the approach proposed in this paper — Jezero — is that semantic idioms from a large codebase can be learned from canonicalized dataflow trees. We propose a scalable, lightweight static analysis-based approach to construct such a tree that is well suited to mine semantic idioms using nonparametric Bayesian methods. Our experiments with Jezero on Hack code shows a clear advantage of adding canonicalized dataflow information to ASTs: Jezero was significantly more effective than a baseline that did not have the dataflow augmentation in being able to effectively find refactoring opportunities from unannotated legacy code.

[1]  Cristina V. Lopes,et al.  SourcererCC: Scaling Code Clone Detection to Big-Code , 2015, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[2]  Jian Pei,et al.  Mining API patterns as partial orders from source code: from usage scenarios to specifications , 2007, ESEC-FSE '07.

[3]  Koushik Sen,et al.  CodeHint: dynamic and interactive synthesis of code snippets , 2014, ICSE.

[4]  Yuanyuan Zhou,et al.  CP-Miner: A Tool for Finding Copy-paste and Related Bugs in Operating System Code , 2004, OSDI.

[5]  M. Wegman,et al.  Global value numbers and redundant computations , 1988, POPL '88.

[6]  Richard C. Waters,et al.  The Programmer's Apprentice: a research overview , 1988, Computer.

[7]  Dan Klein,et al.  Type-Based MCMC , 2010, HLT-NAACL.

[8]  Charles A. Sutton,et al.  Mining idioms from source code , 2014, SIGSOFT FSE.

[9]  Zhi Jin,et al.  Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree , 2020, 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[10]  Miryung Kim,et al.  Does Automated Refactoring Obviate Systematic Editing? , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[11]  Sumit Gulwani,et al.  FlashMeta: a framework for inductive program synthesis , 2015, OOPSLA.

[12]  Premkumar T. Devanbu,et al.  Mining Semantic Loop Idioms , 2018, IEEE Transactions on Software Engineering.

[13]  Chanchal Kumar Roy,et al.  NICAD: Accurate Detection of Near-Miss Intentional Clones Using Flexible Pretty-Printing and Code Normalization , 2008, 2008 16th IEEE International Conference on Program Comprehension.

[14]  Xiaodong Gu,et al.  Deep API learning , 2016, SIGSOFT FSE.

[15]  Sumit Gulwani,et al.  Learning Syntactic Program Transformations from Examples , 2016, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).

[16]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[17]  Guy Van den Broeck,et al.  Active Inductive Logic Programming for Code Search , 2018, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[18]  Phil Blunsom,et al.  Inducing Tree-Substitution Grammars , 2010, J. Mach. Learn. Res..

[19]  Sumit Gulwani,et al.  Feedback-driven semi-supervised synthesis of program transformations , 2020, Proc. ACM Program. Lang..

[20]  Rastislav Bodík,et al.  Jungloid mining: helping to navigate the API jungle , 2005, PLDI '05.

[21]  Maninder Singh,et al.  Software clone detection: A systematic review , 2013, Inf. Softw. Technol..

[22]  Miryung Kim,et al.  Lase: Locating and applying systematic edits by learning from examples , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[23]  Johannes Bader,et al.  Getafix: learning to fix bugs automatically , 2019, Proc. ACM Program. Lang..

[24]  Kathryn T. Stolee,et al.  Solving the Search for Source Code , 2014, ACM Trans. Softw. Eng. Methodol..

[25]  Matias Martinez,et al.  Fine-grained and accurate source code differencing , 2014, ASE.

[26]  Loris D'Antoni,et al.  Learning Quick Fixes from Code Repositories , 2018, SBES.

[27]  Miryung Kim,et al.  Systematic editing: generating program transformations from an example , 2011, PLDI '11.

[28]  Tao Wang,et al.  Convolutional Neural Networks over Tree Structures for Programming Language Processing , 2014, AAAI.

[29]  Sumit Gulwani,et al.  On the fly synthesis of edit suggestions , 2019, Proc. ACM Program. Lang..

[30]  Armando Solar-Lezama,et al.  Semantic code search via equational reasoning , 2020, PLDI.

[31]  Jian Pei,et al.  MAPO: Mining and Recommending API Usage Patterns , 2009, ECOOP.

[32]  Zhendong Su,et al.  DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones , 2007, 29th International Conference on Software Engineering (ICSE'07).

[33]  Hoan Anh Nguyen,et al.  Graph-based mining of multiple object usage patterns , 2009, ESEC/FSE '09.

[34]  Kai Chen,et al.  Mining succinct and high-coverage API usage patterns from source code , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[35]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[36]  Yaniv David,et al.  Tracelet-based code search in executables , 2014, PLDI.

[37]  Rainer Koschke,et al.  Clone Detection Using Abstract Syntax Suffix Trees , 2006, 2006 13th Working Conference on Reverse Engineering.

[38]  Jeffrey Xu Yu,et al.  Matching dependence-related queries in the system dependence graph , 2010, ASE.

[39]  Steven P. Reiss,et al.  Semantics-based code search , 2009, 2009 IEEE 31st International Conference on Software Engineering.