Finding relevant answers in software forums

Online software forums provide a huge amount of valuable content. Developers and users often ask questions and receive answers from such forums. The availability of a vast amount of thread discussions in forums provides ample opportunities for knowledge acquisition and summarization. For a given search query, current search engines use traditional information retrieval approach to extract webpages containing relevant keywords. However, in software forums, often there are many threads containing similar keywords where each thread could contain a lot of posts as many as 1,000 or more. Manually finding relevant answers from these long threads is a painstaking task to the users. Finding relevant answers is particularly hard in software forums as: complexities of software systems cause a huge variety of issues often expressed in similar technical jargons, and software forum users are often expert internet users who often posts answers in multiple venues creating many duplicate posts, often without satisfying answers, in the world wide web. To address this problem, this paper provides a semantic search engine framework to process software threads and recover relevant answers according to user queries. Different from standard information retrieval engine, our framework infer semantic tags of posts in the software forum threads and utilize these tags to recover relevant answer posts. In our case study, we analyze 6,068 posts from three software forums. In terms of accuracy of our inferred tags, we could achieve on average an overall precision, recall and F-measure of 67%, 71%, and 69% respectively. To empirically study the benefit of our overall framework, we also conduct a user-assisted study which shows that as compared to a standard information retrieval approach, our proposed framework could increase mean average precision from 17% to 71% in retrieving relevant answers to various queries and achieve a Normalized Discounted Cumulative Gain (nDCG) @1 score of 91.2% and nDCG@2 score of 71.6%.

[1]  Tao Xie,et al.  SpotWeb: Detecting Framework Hotspots and Coldspots via Mining Open Source Code on the Web , 2008, 2008 23rd IEEE/ACM International Conference on Automated Software Engineering.

[2]  Gary Geunbae Lee,et al.  Semi-supervised Speech Act Recognition in Emails and Forums , 2009, EMNLP.

[3]  Young-In Song,et al.  Finding question-answer pairs from online forums , 2008, SIGIR '08.

[4]  David Lo,et al.  Extracting Paraphrases of Technical Terms from Noisy Parallel Software Corpora , 2009, ACL.

[5]  Ahmed E. Hassan,et al.  Should I contribute to this discussion? , 2010, 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010).

[6]  Margaret-Anne D. Storey Beyond the Lone Reverse Engineer: Insourcing, Outsourcing and Crowdsourcing , 2009, 2009 16th Working Conference on Reverse Engineering.

[7]  Thomas Zimmermann,et al.  Information needs in bug reports: improving cooperation between developers and users , 2010, CSCW '10.

[8]  W. Bruce Croft,et al.  Finding similar questions in large question and answer archives , 2005, CIKM '05.

[9]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[10]  Jonathan I. Maletic,et al.  An approach to mining call-usage patternswith syntactic context , 2007, ASE.

[11]  Spiros Mancoridis,et al.  A Reverse Engineering Tool for Extracting Protocols of Networked Applications , 2007, 14th Working Conference on Reverse Engineering (WCRE 2007).

[12]  Denys Poshyvanyk,et al.  Who can help me with this change request? , 2009, 2009 IEEE 17th International Conference on Program Comprehension.

[13]  Kenny Wong,et al.  What can programmer questions tell us about frameworks? , 2005, 13th International Workshop on Program Comprehension (IWPC'05).

[14]  Yong Yu,et al.  Searching Questions by Identifying Question Topic and Question Focus , 2008, ACL.

[15]  Tao Xie,et al.  Inferring Resource Specifications from Natural Language API Documentation , 2009, 2009 IEEE/ACM International Conference on Automated Software Engineering.

[16]  Brian D. Davison,et al.  A classification-based approach to question answering in discussion boards , 2009, SIGIR.

[17]  Ahmed Tamrawi,et al.  Fuzzy set approach for automatic tagging in evolving software , 2010, 2010 IEEE International Conference on Software Maintenance.

[18]  Andreas Zeller,et al.  Mining version histories to guide software changes , 2005, Proceedings. 26th International Conference on Software Engineering.

[19]  Christoph Treude,et al.  How tagging helps bridge the gap between social and technical aspects in software development , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[20]  Tao Xie,et al.  An approach to detecting duplicate bug reports using natural language and execution information , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[21]  Sunghun Kim,et al.  Toward an understanding of bug fix patterns , 2009, Empirical Software Engineering.

[22]  Ahmed E. Hassan,et al.  What Can OSS Mailing Lists Tell Us? A Preliminary Psychometric Text Analysis of the Apache Developer Mailing List , 2007, Fourth International Workshop on Mining Software Repositories (MSR'07:ICSE Workshops 2007).

[23]  Bogdan Dit,et al.  Improving the readability of defect reports , 2008, RSSE '08.

[24]  Andreas Zeller,et al.  Mining temporal specifications from object usage , 2011, Automated Software Engineering.

[25]  Sebastián Uchitel,et al.  Detecting Implied Scenarios from Execution Traces , 2007, 14th Working Conference on Reverse Engineering (WCRE 2007).

[26]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[27]  Michael Gertz,et al.  Mining email social networks , 2006, MSR '06.

[28]  Daniela E. Damian,et al.  Predicting build failures using social network analysis on developer communication , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[29]  Hoan Anh Nguyen,et al.  Graph-based mining of multiple object usage patterns , 2009, ESEC/FSE '09.

[30]  Xiaoyan Zhu,et al.  Using Conditional Random Fields to Extract Contexts and Answers of Questions from Online Forums , 2008, ACL.