APIReal: an API recognition and linking approach for online developer forums

When discussing programming issues on social platforms (e.g, Stack Overflow, Twitter), developers often mention APIs in natural language texts. Extracting API mentions from natural language texts serves as the prerequisite to effective indexing and searching for API-related information in software engineering social content. The task of extracting API mentions from natural language texts involves two steps: 1) distinguishing API mentions from other English words (i.e., API recognition), 2) disambiguating a recognized API mention to its unique fully qualified name (i.e., API linking). Software engineering social content lacks consistent API mentions and sentence writing format. As a result, API recognition and linking have to deal with the inherent ambiguity of API mentions in informal text, for example, due to the ambiguity between the API sense of a common word and the normal sense of the word (e.g., append, apply and merge), the simple name of an API can map to several APIs of the same library or of different libraries, or different writing forms of an API should be linked to the same API. In this paper, we propose a semi-supervised machine learning approach that exploits name synonyms and rich semantic context of API mentions for API recognition in informal text. Based on the results of our API recognition approach, we further propose an API linking approach leveraging a set of domain-specific heuristics, including mention-mention similarity, scope filtering, and mention-entry similarity, to determine which API in the knowledge base a recognized API actually refers to. To evaluate our API recognition approach, we use 1205 API mentions of three libraries (Pandas, Numpy, and Matplotlib) from Stack Overflow text. We also evaluate our API linking approach with 120 recognized API mentions of these three libraries.

[1]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[2]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[3]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[4]  Leon Moonen,et al.  Generating robust parsers using island grammars , 2001, Proceedings Eighth Working Conference on Reverse Engineering.

[5]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[6]  Giuliano Antoniol,et al.  Recovering Traceability Links between Code and Documentation , 2002, IEEE Trans. Software Eng..

[7]  Andrian Marcus,et al.  Recovering documentation-to-source-code traceability links using latent semantic indexing , 2003, 25th International Conference on Software Engineering, 2003. Proceedings..

[8]  Rada Mihalcea,et al.  Co-training and Self-training for Word Sense Disambiguation , 2004, CoNLL.

[9]  Percy Liang,et al.  Semi-Supervised Learning for Natural Language , 2005 .

[10]  Rada Mihalcea,et al.  Wikify!: linking documents to encyclopedic knowledge , 2007, CIKM '07.

[11]  Ian H. Witten,et al.  Learning to link with wikipedia , 2008, CIKM '08.

[12]  Carl K. Chang,et al.  Incremental Latent Semantic Indexing for Automatic Traceability Link Evolution Management , 2008, 2008 23rd IEEE/ACM International Conference on Automated Software Engineering.

[13]  Alberto Bacchelli,et al.  Benchmarking Lightweight Techniques to Link E-Mails and Source Code , 2009, 2009 16th Working Conference on Reverse Engineering.

[14]  Sriharsha Veeramachaneni,et al.  A Simple Semi-supervised Algorithm For Named Entity Recognition , 2009, HLT-NAACL 2009.

[15]  Nan Ye,et al.  Domain adaptive bootstrapping for named entity recognition , 2009, EMNLP.

[16]  Roberto Navigli,et al.  Word sense disambiguation: A survey , 2009, CSUR.

[17]  Romain Robbes,et al.  Linking e-mails and source code artifacts , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[18]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[19]  Michael R. Lyu,et al.  Cross-library API recommendation using web search engines , 2011, ESEC/FSE '11.

[20]  Michele Lanza,et al.  Extracting structured data from natural language documents with island parsing , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[21]  Ming Zhou,et al.  Recognizing Named Entities in Tweets , 2011, ACL.

[22]  Martin P. Robillard,et al.  Recovering traceability links between an API and its learning resources , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[23]  Christoph Treude,et al.  Crowd Documentation : Exploring the Coverage and the Dynamics of API Discussions on Stack Overflow , 2012 .

[24]  Wei Shen,et al.  LIEGE:: link entities in web lists with knowledge base , 2012, KDD.

[25]  Christopher D. Manning,et al.  Effect of Non-linear Deep Architecture in Sequence Labeling , 2013, IJCNLP.

[26]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[27]  Martin P. Robillard,et al.  Discovering essential code elements in informal documentation , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[28]  Tiejun Zhao,et al.  Compound Embedding Features for Semi-supervised Learning , 2013, NAACL.

[29]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[30]  Yitong Li,et al.  Entity Linking for Tweets , 2013, ACL.

[31]  Gabriele Bavota,et al.  How do API changes trigger stack overflow discussions? a study on the Android SDK , 2014, ICPC 2014.

[32]  Reid Holmes,et al.  Live API documentation , 2014, ICSE.

[33]  Wanxiang Che,et al.  Revisiting Embedding Features for Simple Semi-supervised Learning , 2014, EMNLP.

[34]  Chenliang Li,et al.  Fine-grained location extraction from tweets with temporal awareness , 2014, SIGIR.

[35]  Zhiyuan Liu,et al.  A Unified Model for Word Sense Representation and Disambiguation , 2014, EMNLP.

[36]  Jie Wang,et al.  Fixing Recurring Crash Bugs via Analyzing Q&A Sites (T) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[37]  Sunghun Kim,et al.  Crowd debugging , 2015, ESEC/SIGSOFT FSE.

[38]  Aixin Sun,et al.  Mobile phone name extraction from internet forums: a semi-supervised approach , 2016, World Wide Web.

[39]  Gao Cong,et al.  Joint Recognition and Linking of Fine-Grained Locations from Tweets , 2016, WWW.

[40]  Jing Li,et al.  Software-specific part-of-speech tagging: an experimental study on stack overflow , 2016, SAC.

[41]  Chanchal Kumar Roy,et al.  RACK: Automatic API Recommendation Using Crowdsourced Knowledge , 2016, 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[42]  Jing Li,et al.  Software-Specific Named Entity Recognition in Software Engineering Social Content , 2016, 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[43]  Oscar Nierstrasz,et al.  Proceedings of the 23rd IEEE International Conference on Software Analysis, Evolution, and Reengineering (SANER) , 2016 .

[44]  Daqing Hou,et al.  Linking Usage Tutorials into API Client Code , 2016, 2016 IEEE/ACM 3rd International Workshop on CrowdSourcing in Software Engineering (CSI-SE).

[45]  Rabe Abdalkareem,et al.  On code reuse from StackOverflow: An exploratory study on Android apps , 2017, Inf. Softw. Technol..