Learning from Examples to Find Fully Qualified Names of API Elements in Code Snippets

Developers often reuse code snippets from online forums, such as Stack Overflow, to learn API usages of software frameworks or libraries. These code snippets often contain ambiguous undeclared external references. Such external references make it difficult to learn and use those APIs correctly. In particular, reusing code snippets containing such ambiguous undeclared external references requires significant manual efforts and expertise to resolve them. Manually resolving fully qualified names (FQN) of API elements is a non-trivial task. In this paper, we propose a novel context-sensitive technique, called COSTER, to resolve FQNs of API elements in such code snippets. The proposed technique collects locally specific source code elements as well as globally related tokens as the context of FQNs, calculates likelihood scores, and builds an occurrence likelihood dictionary (OLD). Given an API element as a query, COSTER captures the context of the query API element, matches that with the FQNs of API elements stored in the OLD, and rank those matched FQNs leveraging three different scores: likelihood, context similarity, and name similarity scores. Evaluation with more than 600K code examples collected from GitHub and two different Stack Overflow datasets shows that our proposed technique improves precision by 4-6% and recall by 3-22% compared to state-of-the-art techniques. The proposed technique significantly reduces the training time compared to the StatType, a state-of-the-art technique, without sacrificing accuracy. Extensive analyses on results demonstrate the robustness of the proposed technique.

[1]  Zhendong Su,et al.  On the naturalness of software , 2012, ICSE 2012.

[2]  Razvan C. Bunescu,et al.  Learning to rank relevant files for bug reports using domain knowledge , 2014, SIGSOFT FSE.

[3]  Carlo Strapparava,et al.  Corpus-based and Knowledge-based Measures of Text Semantic Similarity , 2006, AAAI.

[4]  Hung Viet Nguyen,et al.  A topic-based approach for narrowing the search space of buggy files from a bug report , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[5]  Richard N. Taylor,et al.  Software traceability with topic modeling , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[6]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[7]  Baowen Xu,et al.  Python probabilistic type inference with natural language support , 2016, SIGSOFT FSE.

[8]  M. Aickin,et al.  Adjusting for multiple testing when reporting research results: the Bonferroni vs Holm methods. , 1996, American journal of public health.

[9]  Siau-Cheng Khoo,et al.  Towards more accurate retrieval of duplicate bug reports , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[10]  Chanchal Kumar Roy,et al.  CSCC: Simple, Efficient, Context Sensitive Code Completion , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[11]  Janice Singer Practices of software maintenance , 1998, Proceedings. International Conference on Software Maintenance (Cat. No. 98CB36272).

[12]  Andrian Marcus,et al.  Recovering documentation-to-source-code traceability links using latent semantic indexing , 2003, 25th International Conference on Software Engineering, 2003. Proceedings..

[13]  Arie van Deursen,et al.  The Maven repository dataset of metrics, changes, and dependencies , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[14]  Andreas Zeller,et al.  Where Should We Fix This Bug? A Two-Phase Recommendation Model , 2013, IEEE Transactions on Software Engineering.

[15]  Romain Robbes,et al.  Linking e-mails and source code artifacts , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[16]  Andreas Krause,et al.  Predicting Program Properties from "Big Code" , 2015, POPL.

[17]  Anh Tuan Nguyen,et al.  Statistical Learning of API Fully Qualified Names in Code Snippets of Online Forums , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[18]  Chris Parnin,et al.  Gistable: Evaluating the Executability of Python Code Snippets on GitHub , 2018, 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[19]  Martin P. Robillard,et al.  Discovering essential code elements in informal documentation , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[20]  Michael Pradel,et al.  NL2Type: Inferring JavaScript Function Types from Natural Language Information , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[21]  Christian Bird,et al.  Deep learning type inference , 2018, ESEC/SIGSOFT FSE.

[22]  David Lo,et al.  Compositional Vector Space Models for Improved Bug Localization , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[23]  Danny Dig,et al.  API code recommendation using statistical learning from fine-grained changes , 2016, SIGSOFT FSE.

[24]  Collin McMillan,et al.  Portfolio: finding relevant functions and their usage , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[25]  Ronald Rosenfeld,et al.  Trigger-based language models: a maximum entropy approach , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[26]  Tien N. Nguyen,et al.  Recovering Variable Names for Minified Code with Usage Contexts , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[27]  Denys Poshyvanyk,et al.  Feature location via information retrieval based filtering of a single scenario execution trace , 2007, ASE.

[28]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[29]  Cristina V. Lopes,et al.  50K-C: A Dataset of Compilable, and Compiled, Java Projects , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).

[30]  Laurie J. Hendren,et al.  Enabling static analysis for partial java programs , 2008, OOPSLA.

[31]  Cristina V. Lopes,et al.  From Query to Usable Code: An Analysis of Stack Overflow Code Snippets , 2016, 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR).

[32]  Jian Zhou,et al.  Where should the bugs be fixed? More accurate information retrieval-based bug localization based on bug reports , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[33]  Sarfraz Khurshid,et al.  Improving bug localization using structured information retrieval , 2013, 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[34]  Anh Tuan Nguyen,et al.  Combining Deep Learning with Information Retrieval to Localize Buggy Files for Bug Reports (N) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[35]  Premkumar T. Devanbu,et al.  On the localness of software , 2014, SIGSOFT FSE.

[36]  Martin P. Robillard,et al.  Creating and evolving developer documentation: understanding the decisions of open source contributors , 2010, FSE '10.

[37]  Charles A. Sutton,et al.  Suggesting accurate method and class names , 2015, ESEC/SIGSOFT FSE.

[38]  Reid Holmes,et al.  Live API documentation , 2014, ICSE.

[39]  Genny Tortora,et al.  Adams re-trace , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[40]  Martin P. Robillard,et al.  Recovering traceability links between an API and its learning resources , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[41]  Christoph Treude,et al.  Measuring API documentation on the web , 2011, Web2SE '11.