Evaluating the evaluations of code recommender systems: A reality check

While researchers develop many new exciting code recommender systems, such as method-call completion, code-snippet completion, or code search, an accurate evaluation of such systems is always a challenge. We analyzed the current literature and found that most of the current evaluations rely on artificial queries extracted from released code, which begs the question: Do such evaluations reflect real-life usages? To answer this question, we capture 6,189 fine-grained development histories from real IDE interactions. We use them as a ground truth and extract 7,157 real queries for a specific method-call recommender system. We compare the results of such real queries with different artificial evaluation strategies and check several assumptions that are repeatedly used in research, but never empirically evaluated. We find that an evolving context that is often observed in practice has a major effect on the prediction quality of recommender systems, but is not commonly reflected in artificial evaluations.

[1]  Stas Negara,et al.  Is It Dangerous to Use Version Control Histories to Study Source Code Evolution? , 2012, ECOOP.

[2]  Markus Herrmannsdoerfer,et al.  Identifier-Based Context-Dependent API Method Recommendation , 2012, 2012 16th European Conference on Software Maintenance and Reengineering.

[3]  Mira Mezini,et al.  A Dataset of Simplified Syntax Trees for C# , 2016, 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR).

[4]  Stas Negara,et al.  Mining fine-grained code changes to detect unknown change patterns , 2014, ICSE.

[5]  Michele Lanza,et al.  The Plague Doctor: A Promising Cure for the Window Plague , 2015, 2015 IEEE 23rd International Conference on Program Comprehension.

[6]  Rastislav Bodík,et al.  Jungloid mining: helping to navigate the API jungle , 2005, PLDI '05.

[7]  Romain Robbes,et al.  How Program History Can Improve Code Completion , 2008, 2008 23rd IEEE/ACM International Conference on Automated Software Engineering.

[8]  Eran Yahav,et al.  Code completion with statistical language models , 2014, PLDI.

[9]  Gabriele Bavota,et al.  How Can I Use This Method? , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[10]  Mary Czerwinski,et al.  Easing program comprehension by sharing navigation data , 2005, 2005 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC'05).

[11]  Martin P. Robillard,et al.  Recommendation Systems for Software Engineering , 2010, IEEE Software.

[12]  Sarah Nadi,et al.  FeedBaG: An interaction tracker for Visual Studio , 2016, 2016 IEEE 24th International Conference on Program Comprehension (ICPC).

[13]  Mik Kersten,et al.  Using task context to improve programmer productivity , 2006, SIGSOFT '06/FSE-14.

[14]  R. Holmes,et al.  Using structural context to recommend source code examples , 2005, Proceedings. 27th International Conference on Software Engineering, 2005. ICSE 2005..

[15]  Ruzica Piskac,et al.  Complete completion using types and weights , 2013, PLDI.

[16]  Anh Tuan Nguyen,et al.  Graph-Based Statistical Language Model for Code , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[17]  Mira Mezini,et al.  Ieee Transactions on Software Engineering 1 Automated Api Property Inference Techniques , 2022 .

[18]  Mik Kersten,et al.  Mylar: a degree-of-interest model for IDEs , 2005, AOSD '05.

[19]  Richard C. Holt,et al.  Replaying development history to assess the effectiveness of change propagation tools , 2006, Empirical Software Engineering.

[20]  Mira Mezini,et al.  Intelligent Code Completion with Bayesian Networks , 2015, ACM Trans. Softw. Eng. Methodol..

[21]  Romain Robbes,et al.  Improving code completion with program history , 2010, Automated Software Engineering.

[22]  Gabriele Bavota,et al.  Mining StackOverflow to turn the IDE into a self-confident programming prompter , 2014, MSR 2014.

[23]  Mira Mezini,et al.  Learning from examples to improve code completion systems , 2009, ESEC/SIGSOFT FSE.

[24]  Anh Tuan Nguyen,et al.  GraPacc: A graph-based pattern-oriented, context-sensitive code completion tool , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[25]  Jian Pei,et al.  MAPO: Mining and Recommending API Usage Patterns , 2009, ECOOP.

[26]  Dirk Riehle,et al.  The empirical commit frequency distribution of open source projects , 2013, OpenSym.

[27]  Yi Zhang,et al.  Automatic parameter recommendation for practical API usage , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[28]  Hung Viet Nguyen,et al.  Graph-based pattern-oriented, context-sensitive source code completion , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[29]  Premkumar T. Devanbu,et al.  On the naturalness of software , 2016, Commun. ACM.