Statistical Translation of English Texts to API Code Templates

We develop T2API, a context-sensitive, graph-based statistical translation approach that takes as input an English description of a programming task and synthesizes the corresponding API code template for the task. We train T2API to statistically learn the alignments between English and API elements and determine the relevant API elements. The training is done on StackOverflow, a bilingual corpus on which developers discuss programming problems in two types of language: English and programming language. T2API considers both the context of the words in the input query and the context of API elements that often go together in the corpus. The derived API elements with their relevance scores are assembled into an API usage by GraSyn, a novel graph-based API synthesis algorithm that generates a graph representing an API usage from a large code corpus. Importantly, it is capable of generating new API usages from previously seen sub-usages. We curate a test benchmark of 250 real-world StackOverflow posts. Across the benchmark, T2API's synthesized snippets have the correct API elements with a median top-1 precision and recall of 67% and 100%, respectively. Four professional developers and five graduate students judged that 77% of our top synthesized API code templates are useful to solve the problem presented in the StackOverflow posts.

[1]  Hoan Anh Nguyen,et al.  Graph-based mining of multiple object usage patterns , 2009, ESEC/FSE '09.

[2]  Martin P. Robillard,et al.  Using Structure-Based Recommendations to Facilitate Discoverability in APIs , 2011, ECOOP.

[3]  Anh Tuan Nguyen,et al.  Graph-Based Statistical Language Model for Code , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[4]  Westley Weimer,et al.  Synthesizing API usage examples , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[5]  Matthew B. Dwyer,et al.  Code search with input/output queries: Generalizing, ranking, and assessment , 2016, J. Syst. Softw..

[6]  Sushil Krishna Bajracharya,et al.  Sourcerer: a search engine for open source code supporting structure-based search , 2006, OOPSLA '06.

[7]  Charles A. Sutton,et al.  Suggesting accurate method and class names , 2015, ESEC/SIGSOFT FSE.

[8]  Laurie J. Hendren,et al.  Enabling static analysis for partial java programs , 2008, OOPSLA.

[9]  John Penix,et al.  Efficient Specification-Based Component Retrieval , 1999, Automated Software Engineering.

[10]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[11]  Collin McMillan,et al.  A search engine for finding highly relevant applications , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[12]  Tao Wang,et al.  TBCNN: A Tree-Based Convolutional Neural Network for Programming Language Processing , 2014, ArXiv.

[13]  Martin White,et al.  Toward Deep Learning Software Repositories , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[14]  Charles A. Sutton,et al.  Learning natural coding conventions , 2014, SIGSOFT FSE.

[15]  Kathryn T. Stolee,et al.  Solving the Search for Source Code , 2014, ACM Trans. Softw. Eng. Methodol..

[16]  Charles A. Sutton,et al.  Mining source code repositories at massive scale using language modeling , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[17]  Yang Cai,et al.  Api hyperlinking via structural overlap , 2009, ESEC/SIGSOFT FSE.

[18]  Viktor Kuncak,et al.  Synthesizing Java expressions from free-form queries , 2015, OOPSLA.

[19]  Rastislav Bodík,et al.  Jungloid mining: helping to navigate the API jungle , 2005, PLDI '05.

[20]  Scott R. Klemmer,et al.  Example-centric programming: integrating web search into the development environment , 2010, CHI.

[21]  Shinji Kusumoto,et al.  Component rank: relative significance rank for software component search , 2003, 25th International Conference on Software Engineering, 2003. Proceedings..

[22]  Tao Xie,et al.  Parseweb: a programmer assistant for reusing open source code on the web , 2007, ASE.

[23]  Anh Tuan Nguyen,et al.  T2API: synthesizing API code usage templates from English texts with statistical translation , 2016, SIGSOFT FSE.

[24]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[25]  Collin McMillan,et al.  Portfolio: finding relevant functions and their usage , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[26]  Mukund Raghothaman,et al.  SWIM: Synthesizing What I Mean , 2015, ArXiv.

[27]  Andrew D. Gordon,et al.  Bimodal Modelling of Source Code and Natural Language , 2015, ICML.

[28]  Steven P. Reiss,et al.  Semantics-based code search , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[29]  Xiao Ma,et al.  From Word Embeddings to Document Similarities for Improved Information Retrieval in Software Engineering , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[30]  Michael R. Lyu,et al.  Flow-Augmented Call Graph: A New Foundation for Taming API Complexity , 2011, FASE.

[31]  Martin P. Robillard,et al.  Discovering essential code elements in informal documentation , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[32]  Philip Koehn,et al.  Statistical Machine Translation , 2010, EAMT.

[33]  Kajal T. Claypool,et al.  XSnippet: mining For sample code , 2006, OOPSLA '06.

[34]  Fabrizio Silvestri,et al.  The social network of Java classes , 2006, SAC.

[35]  Eran Yahav,et al.  Typestate-based semantic code search over partial programs , 2012, OOPSLA '12.

[36]  Xiaodong Gu,et al.  Deep API learning , 2016, SIGSOFT FSE.

[37]  Daniel Tarlow,et al.  Structured Generative Models of Natural Source Code , 2014, ICML.

[38]  Sushil Krishna Bajracharya,et al.  CodeGenie:: a tool for test-driven source code search , 2007, OOPSLA '07.

[39]  Sumit Gulwani,et al.  Program Synthesis Using Natural Language , 2015, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[40]  Mukund Raghothaman,et al.  SWIM: Synthesizing What I Mean - Code Search and Idiomatic Snippet Synthesis , 2015, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[41]  Hong Cheng,et al.  Searching connected API subgraph via text phrases , 2012, SIGSOFT FSE.

[42]  Premkumar T. Devanbu,et al.  Recommending random walks , 2007, ESEC-FSE '07.

[43]  Jian Pei,et al.  MAPO: Mining and Recommending API Usage Patterns , 2009, ECOOP.

[44]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[45]  Eran Yahav,et al.  Code completion with statistical language models , 2014, PLDI.

[46]  Collin McMillan,et al.  Recommending source code examples via API call usages and documentation , 2010, RSSE '10.

[47]  Martin P. Robillard,et al.  Asking and answering questions about unfamiliar APIs: An exploratory study , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[48]  Zhendong Su,et al.  On the naturalness of software , 2012, ICSE 2012.

[49]  Michael R. Lyu,et al.  Cross-library API recommendation using web search engines , 2011, ESEC/FSE '11.

[50]  Gail C. Murphy,et al.  Reverb: Recommending code-related web pages , 2013, 2013 35th International Conference on Software Engineering (ICSE).