Mining Likely Analogical APIs Across Third-Party Libraries via Large-Scale Unsupervised API Semantics Embedding

Establishing API mappings between third-party libraries is a prerequisite step for library migration tasks. Manually establishing API mappings is tedious due to the large number of APIs to be examined. Having an automatic technique to create a database of likely API mappings can significantly ease the task. Unfortunately, existing techniques either adopt supervised learning mechanism that requires already-ported or functionality similar applications across major programming languages or platforms, which are difficult to come by for an arbitrary pair of third-party libraries, or cannot deal with lexical gap in the API descriptions of different libraries. To overcome these limitations, we present an unsupervised deep learning based approach to embed both API usage semantics and API description (name and document) semantics into vector space for inferring likely analogical API mappings between libraries. Based on deep learning models trained using tens of millions of API call sequences, method names and comments of 2.8 millions of methods from 135,127 GitHub projects, our approach significantly outperforms other deep learning or traditional information retrieval (IR) methods for inferring likely analogical APIs. We implement a proof-of-concept website (https://similarapi.appspot.com) which can recommend analogical APIs for 583,501 APIs of 111 pairs of analogical Java libraries with diverse functionalities. This scale of third-party analogical-API database has never been achieved before.

[1]  Siau-Cheng Khoo,et al.  Towards more accurate retrieval of duplicate bug reports , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[2]  Yang Liu,et al.  What’s Spain’s Paris? Mining analogical libraries from Q&A discussions , 2018, Empirical Software Engineering.

[3]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[4]  Danny Dig,et al.  API code recommendation using statistical learning from fine-grained changes , 2016, SIGSOFT FSE.

[5]  Aditya Kanade,et al.  Mining Unit Tests for Discovery and Migration of Math APIs , 2014, TSEM.

[6]  Charles A. Sutton,et al.  Suggesting accurate method and class names , 2015, ESEC/SIGSOFT FSE.

[7]  Radhika S. Grover,et al.  Programming with Java: A Multimedia Approach , 2011 .

[8]  Zhi Jin,et al.  Learning Embeddings of API Tokens to Facilitate Deep Learning Based Program Processing , 2016, KSEM.

[9]  K. Pearson VII. Note on regression and inheritance in the case of two parents , 1895, Proceedings of the Royal Society of London.

[10]  Sanjeev Arora,et al.  A Simple but Tough-to-Beat Baseline for Sentence Embeddings , 2017, ICLR.

[11]  Martin White,et al.  Toward Deep Learning Software Repositories , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[12]  Marco Tulio Valente,et al.  Historical and impact analysis of API breaking changes: A large-scale study , 2017, 2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[13]  Xiaodong Gu,et al.  Deep Code Search , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[14]  Xiaodong Gu,et al.  Deep API learning , 2016, SIGSOFT FSE.

[15]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[16]  Xiao Ma,et al.  From Word Embeddings to Document Similarities for Improved Information Retrieval in Software Engineering , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[17]  Collin McMillan,et al.  ExPort: Detecting and visualizing API usages in large source code repositories , 2013, 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[18]  Miryung Kim,et al.  Are Code Examples on an Online Q&A Forum Reliable?: A Study of API Misuse on Stack Overflow , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[19]  Zhenchang Xing,et al.  Towards Correlating Search on Google and Asking on Stack Overflow , 2016, 2016 IEEE 40th Annual Computer Software and Applications Conference (COMPSAC).

[20]  Martin P. Robillard,et al.  Asking and answering questions about unfamiliar APIs: An exploratory study , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[21]  David Lo,et al.  Automated library recommendation , 2013, 2013 20th Working Conference on Reverse Engineering (WCRE).

[22]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[23]  Yang Liu,et al.  Tell Them Apart: Distilling Technology Differences from Crowd-Scale Comparison Discussions , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[24]  Premkumar T. Devanbu,et al.  On the naturalness of software , 2016, Commun. ACM.

[25]  Gabriele Bavota,et al.  How Can I Use This Method? , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[26]  Zhenchang Xing,et al.  A Neural Model for Method Name Generation from Functional Description , 2019, 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[27]  Hridesh Rajan,et al.  Boa: A language and infrastructure for analyzing ultra-large-scale software repositories , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[28]  Laurie A. Williams,et al.  Discovering likely mappings between APIs using text mining , 2015, 2015 IEEE 15th International Working Conference on Source Code Analysis and Manipulation (SCAM).

[29]  Trong Duc Nguyen,et al.  Exploring API Embedding for API Usages and Applications , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).

[30]  Xavier Blanc,et al.  Mining Library Migration Graphs , 2012, 2012 19th Working Conference on Reverse Engineering.

[31]  Jian Pei,et al.  MAPO: Mining and Recommending API Usage Patterns , 2009, ECOOP.

[32]  Thomas Demeester,et al.  Learning Semantic Similarity for Very Short Texts , 2015, 2015 IEEE International Conference on Data Mining Workshop (ICDMW).

[33]  Grant Palmer Technical Java : developing scientific and engineering applications , 2003 .

[34]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[35]  Zhenchang Xing,et al.  Mining Analogical Libraries in Q&A Discussions -- Incorporating Relational and Categorical Knowledge into Word Embedding , 2016, 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[36]  Trong Duc Nguyen,et al.  Statistical Migration of API Usages , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C).

[37]  Andreas Krause,et al.  Predicting Program Properties from "Big Code" , 2015, POPL.

[38]  Zhenchang Xing,et al.  Unsupervised Software-Specific Morphological Forms Inference from Informal Discussions , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).

[39]  Zhenchang Xing,et al.  TechLand: Assisting Technology Landscape Inquiries with Insights from Stack Overflow , 2016, 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[40]  Monika Eisenhower,et al.  Elements Of Survey Sampling , 2016 .

[41]  Hridesh Rajan,et al.  Boa: Ultra-Large-Scale Software Repository and Source-Code Mining , 2015, ACM Trans. Softw. Eng. Methodol..

[42]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[43]  Hong Mei,et al.  An Empirical Study on API Usages , 2019, IEEE Transactions on Software Engineering.

[44]  Hridesh Rajan,et al.  Mining billions of AST nodes to study actual and potential usage of Java language features , 2014, ICSE.

[45]  Zhenchang Xing,et al.  API Method Recommendation without Worrying about the Task-API Knowledge Gap , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[46]  Zhi Jin,et al.  Learning to Infer API Mappings from API Documents , 2017, KSEM.

[47]  Yogesh Padmanaban,et al.  Inferring likely mappings between APIs , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[48]  Charles A. Sutton,et al.  Mining idioms from source code , 2014, SIGSOFT FSE.

[49]  Xin Chen,et al.  Recommending APIs for API Related Questions in Stack Overflow , 2018, IEEE Access.

[50]  Chanchal Kumar Roy,et al.  RACK: Automatic API Recommendation Using Crowdsourced Knowledge , 2016, 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[51]  Kasparian Raffi Java For Artists: The Art, Philosophy, And Science Of Object-Oriented Programming , 2006 .

[52]  Tao Xie,et al.  An approach to detecting duplicate bug reports using natural language and execution information , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[53]  Zhiyuan Liu,et al.  Topical Word Embeddings , 2015, AAAI.

[54]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[55]  Xiaodong Gu,et al.  DeepAM: Migrate APIs with Multi-modal Sequence to Sequence Learning , 2017, IJCAI.

[56]  Zhenchang Xing,et al.  SimilarTech: Automatically recommend analogical libraries across different programming languages , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[57]  Hans Peter Luhn,et al.  A Statistical Approach to Mechanized Encoding and Searching of Literary Information , 1957, IBM J. Res. Dev..

[58]  Zhenchang Xing,et al.  Learning a dual-language vector space for domain-specific cross-lingual question retrieval , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[59]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[60]  Laurie J. Hendren,et al.  Enabling static analysis for partial java programs , 2008, OOPSLA.

[61]  Zhenchang Xing,et al.  Mining Technology Landscape from Stack Overflow , 2016, ESEM.

[62]  Qing Wang,et al.  Mining API mapping for language migration , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[63]  Guoyin Wang,et al.  Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms , 2018, ACL.

[64]  Xavier Blanc,et al.  Automatic discovery of function mappings between similar libraries , 2013, 2013 20th Working Conference on Reverse Engineering (WCRE).

[65]  Eleni Stroulia,et al.  API-Evolution Support with Diff-CatchUp , 2007, IEEE Transactions on Software Engineering.

[66]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[67]  Martin White,et al.  Deep learning code fragments for code clone detection , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[68]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[69]  Anh Tuan Nguyen,et al.  Statistical learning approach for mining API usage mappings for code migration , 2014, ASE.