Parameter-free probabilistic API mining across GitHub

Existing API mining algorithms can be difficult to use as they require expensive parameter tuning and the returned set of API calls can be large, highly redundant and difficult to understand. To address this, we present PAM (Probabilistic API Miner), a near parameter-free probabilistic algorithm for mining the most interesting API call patterns. We show that PAM significantly outperforms both MAPO and UPMiner, achieving 69% test-set precision, at retrieving relevant API call sequences from GitHub. Moreover, we focus on libraries for which the developers have explicitly provided code examples, yielding over 300,000 LOC of hand-written API example code from the 967 client projects in the data set. This evaluation suggests that the hand-written examples actually have limited coverage of real API usages.

[1]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[2]  Gabriele Bavota,et al.  How Can I Use This Method? , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[3]  R. Holmes,et al.  Using structural context to recommend source code examples , 2005, Proceedings. 27th International Conference on Software Engineering, 2005. ICSE 2005..

[4]  Umeshwar Dayal,et al.  PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth , 2001, ICDE 2001.

[5]  Jian Pei,et al.  MAPO: mining API usages from open source repositories , 2006, MSR '06.

[6]  Westley Weimer,et al.  Synthesizing API usage examples , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[7]  Takeaki Uno,et al.  Frequent Pattern Mining , 2016, Encyclopedia of Algorithms.

[8]  Qiming Chen,et al.  PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth , 2001, Proceedings 17th International Conference on Data Engineering.

[9]  Charles A. Sutton,et al.  Learning natural coding conventions , 2014, SIGSOFT FSE.

[10]  Zhenmin Li,et al.  PR-Miner: automatically extracting implicit programming rules and detecting violations in large software code , 2005, ESEC/FSE-13.

[11]  Benjamin Livshits,et al.  DynaMine: finding common error patterns by mining software revision histories , 2005, ESEC/FSE-13.

[12]  Charles A. Sutton,et al.  Mining source code repositories at massive scale using language modeling , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[13]  Charles A. Sutton,et al.  A Subsequence Interleaving Model for Sequential Pattern Mining , 2016, KDD.

[14]  Jiawei Han,et al.  BIDE: efficient mining of frequent closed sequences , 2004, Proceedings. 20th International Conference on Data Engineering.

[15]  Hridesh Rajan,et al.  A study of repetitiveness of code changes in software evolution , 2013, 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[16]  Martin P. Robillard,et al.  What Makes APIs Hard to Learn? Answers from Developers , 2009, IEEE Software.

[17]  Martin P. Robillard,et al.  Temporal analysis of API usage concepts , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[18]  Jilles Vreeken,et al.  The long and the short of it: summarising event sequences with serial episodes , 2012, KDD.

[19]  Koushik Sen,et al.  SNIFF: A Search Engine for Java Using Free-Form Queries , 2009, FASE.

[20]  Mohammed J. Zaki,et al.  SPADE: An Efficient Algorithm for Mining Frequent Sequences , 2004, Machine Learning.

[21]  Frank Maurer,et al.  What makes a good code example?: A study of programming Q&A in StackOverflow , 2012, 2012 28th IEEE International Conference on Software Maintenance (ICSM).

[22]  Eran Yahav,et al.  Code completion with statistical language models , 2014, PLDI.

[23]  Hridesh Rajan,et al.  Mining preconditions of APIs in large-scale code corpus , 2014, FSE 2014.

[24]  Johannes Gehrke,et al.  Sequential PAttern mining using a bitmap representation , 2002, KDD.

[25]  Kai Chen,et al.  Mining succinct and high-coverage API usage patterns from source code , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[26]  Hridesh Rajan,et al.  Boa: A language and infrastructure for analyzing ultra-large-scale software repositories , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[27]  Martin P. Robillard,et al.  How API Documentation Fails , 2015, IEEE Software.

[28]  Gail C. Murphy,et al.  Using structural context to recommend source code examples , 2005, Proceedings. 27th International Conference on Software Engineering, 2005. ICSE 2005..

[29]  Premkumar T. Devanbu,et al.  On the naturalness of software , 2016, Commun. ACM.

[30]  Ying Zou,et al.  Spotting working code examples , 2014, ICSE.

[31]  Jian Pei,et al.  Mining API patterns as partial orders from source code: from usage scenarios to specifications , 2007, ESEC-FSE '07.

[32]  Martin P. Robillard,et al.  Selection and presentation practices for code example summarization , 2014, SIGSOFT FSE.

[33]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[34]  Jian Pei,et al.  MAPO: Mining and Recommending API Usage Patterns , 2009, ECOOP.

[35]  Jiawei Han,et al.  Frequent pattern mining: current status and future directions , 2007, Data Mining and Knowledge Discovery.

[36]  Anh Tuan Nguyen,et al.  A statistical semantic language model for source code , 2013, ESEC/FSE 2013.

[37]  Ramakrishnan Srikant,et al.  Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.

[38]  Toon Calders,et al.  Mining Compressing Sequential Patterns , 2012, Stat. Anal. Data Min..

[39]  Hridesh Rajan,et al.  Mining billions of AST nodes to study actual and potential usage of Java language features , 2014, ICSE.

[40]  Gabriele Bavota,et al.  Mining StackOverflow to turn the IDE into a self-confident programming prompter , 2014, MSR 2014.

[41]  Martin P. Robillard,et al.  A field study of API learning obstacles , 2011, Empirical Software Engineering.

[42]  Sushil Krishna Bajracharya,et al.  Leveraging usage similarity for effective retrieval of examples in code repositories , 2010, FSE '10.

[43]  Nir Friedman,et al.  The Bayesian Structural EM Algorithm , 1998, UAI.

[44]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .