The adverse effects of code duplication in machine learning models of code

The field of big code relies on mining large corpora of code to perform some learning task towards creating better tools for software engineers. A significant threat to this approach was recently identified by Lopes et al. (2017) who found a large amount of near-duplicate code on GitHub. However, the impact of code duplication has not been noticed by researchers devising machine learning models for source code. In this work, we explore the effects of code duplication on machine learning models showing that reported performance metrics are sometimes inflated by up to 100% when testing on duplicated code corpora compared to the performance on de-duplicated corpora which more accurately represent how machine learning models of code are used by software engineers. We present a duplication index for widely used datasets, list best practices for collecting code corpora and evaluating machine learning models on them. Finally, we release tools to help the community avoid this problem in future research.

[1]  Eran Yahav,et al.  Code completion with statistical language models , 2014, PLDI.

[2]  Christian Bird,et al.  Deep learning type inference , 2018, ESEC/SIGSOFT FSE.

[3]  Chanchal K. Roy,et al.  A Survey on Software Clone Detection Research , 2007 .

[4]  Baishakhi Ray,et al.  Cross-project code clones in GitHub , 2018, Empirical Software Engineering.

[5]  Premkumar T. Devanbu,et al.  On the naturalness of software , 2016, Commun. ACM.

[6]  Premkumar T. Devanbu,et al.  A Survey of Machine Learning for Big Code and Naturalness , 2017, ACM Comput. Surv..

[7]  Omer Levy,et al.  code2seq: Generating Sequences from Structured Representations of Code , 2018, ICLR.

[8]  Cristina V. Lopes,et al.  SourcererCC: Scaling Code Clone Detection to Big-Code , 2015, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[9]  Jing Li,et al.  The Qualitas Corpus: A Curated Collection of Java Code for Empirical Studies , 2010, 2010 Asia Pacific Software Engineering Conference.

[10]  Marc Brockschmidt,et al.  Learning to Represent Programs with Graphs , 2017, ICLR.

[11]  Andreas Krause,et al.  Learning programs from noisy data , 2016, POPL.

[12]  Martin T. Vechev,et al.  PHOG: Probabilistic Model for Code , 2016, ICML.

[13]  Daniel Tarlow,et al.  Structured Generative Models of Natural Source Code , 2014, ICML.

[14]  Uri Alon,et al.  code2vec: learning distributed representations of code , 2018, Proc. ACM Program. Lang..

[15]  Alvin Cheung,et al.  Summarizing Source Code using a Neural Attention Model , 2016, ACL.

[16]  Charles A. Sutton,et al.  A Convolutional Attention Network for Extreme Summarization of Source Code , 2016, ICML.

[17]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[18]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[19]  Rico Sennrich,et al.  A Parallel Corpus of Python Functions and Documentation Strings for Automated Code Documentation and Code Generation , 2017, IJCNLP.

[20]  Andreas Krause,et al.  Predicting Program Properties from "Big Code" , 2015, POPL.

[21]  Alvin Cheung,et al.  Mapping Language to Code in Programmatic Context , 2018, EMNLP.

[22]  Jan Vitek,et al.  DéjàVu: a map of code duplicates on GitHub , 2017, Proc. ACM Program. Lang..

[23]  Michael W. Godfrey,et al.  “Cloning considered harmful” considered harmful: patterns of cloning in software , 2008, Empirical Software Engineering.

[24]  Percy Liang,et al.  A Retrieve-and-Edit Framework for Predicting Structured Outputs , 2018, NeurIPS.

[25]  José Nelson Amaral,et al.  Syntax errors just aren't natural: improving error reporting with language models , 2014, MSR 2014.

[26]  Charles A. Sutton,et al.  Mining source code repositories at massive scale using language modeling , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[27]  Premkumar T. Devanbu,et al.  Are deep neural networks the best choice for modeling source code? , 2017, ESEC/SIGSOFT FSE.