A novel approach for recommending semantically linkable issues in GitHub projects

Dear editor, GitHub is a web-based project hosting platform which was launched in 2008 and has become one of the premier open-source development sites [1]. During the software development process of GitHub projects, issue reports, as an important development knowledge, are likely to be related as they contain relevant information. One of the manifestations is that many open issues in a project get linked to related issues by URL referencing. For a given issue, we refer to its related issues as linkable issues, or L-issues. Informing developers about L-issues can help them find similar bugs, useful resources, and critical information, which are helpful for quickly and efficiently resolving issues. This is especially true for newcomers who have just begun getting involved in the development process. Identifying L-issues can help them gain insights into the relationships between known issues, comprehend development requirements, and avoid duplicate work. Unfortunately, information regrading L-issues are not readily available; developers often miss them during the issue resolution process. Especially for the large projects, developers might have to investigate a large number of issues, make the connection from the description of the issue(s) in the issue report to the L-issues. This manual linking process costs time and effort, depending on the experience and knowledge of the developers. Although GitHub provides an issue search engine, we initially tried to use it and found that the current search engine of GitHub does not allow one to conveniently find L-issues. Often, many of the top results are not the appropriate L-issues. Thus, providing an automated approach is necessary for locating L-issues to help developers save costs and let them focus on the knowledge related to a particular issue. Our approach. Our approach works by generating and comparing distributed vectors of different dimensions for issue textual contents. First, we extracted the text information from all issue data. Then, for each issue report, we combined its title and description into a single document. Next, we preprocessed those issue documents via the following steps: (1) extracting all the words from each issue document; (2) removing stop words, numbers, punctuation marks, and other non-alphabetic characters; and (3) using Lancaster Stemmer method to transform the remaining words to their root forms, for reducing the feature dimensions and unifying similar words into a common representation. Based on the preprocessed issue documents, we trained three vector representation models: • Term frequency-inverse document frequency (TF-IDF) model. As the basic model, we wish to capture the relationship of word frequency between two issue documents. TF-IDF is one of the most popular information retrieval techniques. The main idea of TF-IDF is that if a term appears

[1]  Marco Tulio Valente,et al.  An Empirical Study on Recommendations of Similar Bugs , 2016, 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[2]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[3]  Huaimin Wang,et al.  Within-ecosystem issue linking: a large-scale study of rails , 2018, SoftwareMining@ASE.

[4]  Jian Zhou,et al.  Where should the bugs be fixed? More accurate information retrieval-based bug localization based on bug reports , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[5]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[6]  David Lo,et al.  Practitioners' expectations on automated fault localization , 2016, ISSTA.

[7]  Gang Yin,et al.  Social media in GitHub: the role of @-mention in assisting software development , 2015, Science China Information Sciences.