Know-How in Programming Tasks: From Textual Tutorials to Task-Oriented Knowledge Graph

Accomplishing a program task usually involves performing multiple activities in a logical order. Task-solving activities may have different relationships, such as subactivityof, precede-follow, and different attributes, such as location, condition, API, code. We refer to task-solving activities and their relationships and attributes as know-how knowledge. Programming task know-how knowledge is commonly documented in semi-structured textual tutorials. A formative study of the 20 top-viewed Android-tagged how-to questions on Stack Overflow suggests that developers are faced with three information barriers (incoherent modeling of task intent, tutorial information overload and unstructured task activity description) for effectively discovering and understanding task-solving knowledge in textual tutorials. Knowledge graph has been shown to be effective in representing relational knowledge and supporting knowledge search in a structured way. Unfortunately, existing knowledge graphs extract only know-what information (e.g., APIs, API caveats and API dependencies) from software documentation. In this paper, we devise open information extraction (OpenIE) techniques to extract candidates for task activities, activity attributes and activity relationships from programming task tutorials. The resulting knowledge graph, TaskKG, includes a hierarchical taxonomy of activities, three types of activities relationships and five types of activity attributes, and enables activity-centric knowledge search. As a proof-of-concept, we apply our approach to Android Developer Guide. A comprehensive evaluation of TaskKG shows high accuracy of our OpenIE techniques. A user study shows that TaskKG is promising in helping developers finding correct answers to programming how-to questions.

[1]  Martin P. Robillard,et al.  Recommending reference API documentation , 2015, Empirical Software Engineering.

[2]  Martin P. Robillard,et al.  A field study of API learning obstacles , 2011, Empirical Software Engineering.

[3]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[4]  Gerhard Weikum,et al.  Distilling Task Knowledge from How-To Communities , 2017, WWW.

[5]  John C. Grundy,et al.  Improving automated documentation to code traceability by combining retrieval techniques , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[6]  Reid Holmes,et al.  Live API documentation , 2014, ICSE.

[7]  Xin Wang,et al.  Automatic Generation of API Documentations for Open-Source Projects , 2018, 2018 IEEE Third International Workshop on Dynamic Software Documentation (DySDoc3).

[8]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[9]  Markus Krötzsch,et al.  Wikidata , 2014, Commun. ACM.

[10]  Brad A. Myers,et al.  Improving API usability , 2016, Commun. ACM.

[11]  Zhenchang Xing,et al.  Learning a dual-language vector space for domain-specific cross-lingual question retrieval , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[12]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[13]  Zhenchang Xing,et al.  Mining Likely Analogical APIs Across Third-Party Libraries via Large-Scale Unsupervised API Semantics Embedding , 2019, IEEE Transactions on Software Engineering.

[14]  Jens Lehmann,et al.  DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia , 2015, Semantic Web.

[15]  Jonathan I. Maletic,et al.  Improving Feature Location by Enhancing Source Code with Stereotypes , 2013, 2013 IEEE International Conference on Software Maintenance.

[16]  Jing Li,et al.  Learning to Extract API Mentions from Informal Natural Language Discussions , 2016, 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[17]  David Lo,et al.  Identifying self-admitted technical debt in open source projects using text mining , 2017, Empirical Software Engineering.

[18]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[19]  Joseph T. Burke Utilizing feature location techniques for feature addition and feature enhancement , 2014, ASE.

[20]  Martin P. Robillard,et al.  How API Documentation Fails , 2015, IEEE Software.

[21]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[22]  Ponnurangam Kumaraguru,et al.  Language Identification and Named Entity Recognition in Hinglish Code Mixed Tweets , 2018, ACL.

[23]  Jason Weston,et al.  Translating Embeddings for Modeling Multi-relational Data , 2013, NIPS.

[24]  Xiao Ma,et al.  From Word Embeddings to Document Similarities for Improved Information Retrieval in Software Engineering , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[25]  Martin P. Robillard,et al.  Recovering traceability links between an API and its learning resources , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[26]  Christoph Treude,et al.  Extracting Development Tasks to Navigate Software Documentation , 2015, IEEE Transactions on Software Engineering.

[27]  Kang Zhang,et al.  Who asked what: integrating crowdsourced FAQs into API documentation , 2014, ICSE Companion.

[28]  Jiamou Sun,et al.  Improving API Caveats Accessibility by Mining API Caveats Knowledge Graph , 2018, 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[29]  Jens Lehmann,et al.  Neural Network-based Question Answering over Knowledge Graphs on Word and Character Level , 2017, WWW.

[30]  Collin McMillan,et al.  Automatic documentation generation via source code summarization of method context , 2014, ICPC 2014.

[31]  Martin P. Robillard,et al.  Discovering essential code elements in informal documentation , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[32]  Zhenchang Xing,et al.  AnswerBot: Automated generation of answer summary to developers' technical questions , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[33]  Sooyong Park,et al.  API Document Quality for Resolving Deprecated APIs , 2014, 2014 21st Asia-Pacific Software Engineering Conference.

[34]  Christoph Treude,et al.  Crowd Documentation : Exploring the Coverage and the Dynamics of API Discussions on Stack Overflow , 2012 .

[35]  Jing Li,et al.  HDSKG: Harvesting domain specific knowledge graph from content of webpages , 2017, 2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[36]  Zhenchang Xing,et al.  API Method Recommendation without Worrying about the Task-API Knowledge Gap , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[37]  Wei Xu,et al.  CFO: Conditional Focused Neural Question Answering with Large-scale Knowledge Bases , 2016, ACL.

[38]  Lucas Batista Leite de Souza,et al.  Searching crowd knowledge to recommend solutions for API usage tasks , 2016, J. Softw. Evol. Process..

[39]  Zhenchang Xing,et al.  Domain-specific cross-language relevant question retrieval , 2016, 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR).

[40]  Jonathan I. Maletic,et al.  srcML: An Infrastructure for the Exploration, Analysis, and Manipulation of Source Code: A Tool Demonstration , 2013, 2013 IEEE International Conference on Software Maintenance.

[41]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[42]  Jason Weston,et al.  Question Answering with Subgraph Embeddings , 2014, EMNLP.

[43]  Zhenchang Xing,et al.  Mining Analogical Libraries in Q&A Discussions -- Incorporating Relational and Categorical Knowledge into Word Embedding , 2016, 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[44]  Tao Zhang,et al.  An Unsupervised Approach for Discovering Relevant Tutorial Fragments for APIs , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).

[45]  Zhenchang Xing,et al.  DeepWeak: Reasoning common software weaknesses via knowledge graph embedding , 2018, 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[46]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[47]  Martin P. Robillard,et al.  What Makes APIs Hard to Learn? Answers from Developers , 2009, IEEE Software.

[48]  Jens Lehmann,et al.  LC-QuAD: A Corpus for Complex Question Answering over Knowledge Graphs , 2017, SEMWEB.