Semi-automatically extracting FAQs to improve accessibility of software development knowledge

Frequently asked questions (FAQs) are a popular way to document software development knowledge. As creating such documents is expensive, this paper presents an approach for automatically extracting FAQs from sources of software development discussion, such as mailing lists and Internet forums, by combining techniques of text mining and natural language processing. We apply the approach to popular mailing lists and carry out a survey among software developers to show that it is able to extract high-quality FAQs that may be further improved by experts.

[1]  Oren Etzioni,et al.  Scaling question answering to the Web , 2001, WWW '01.

[2]  Qing Yang,et al.  Predicting Best Answerers for New Questions in Community Question Answering , 2010, WAIM.

[3]  Sanda M. Harabagiu,et al.  Methods for Using Textual Entailment in Open-Domain Question Answering , 2006, ACL.

[4]  David Lo,et al.  Finding relevant answers in software forums , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[5]  Dick Ng'ambi Pre-empting user questions through anticipation: data mining FAQ lists , 2002 .

[6]  Valentin Jijkoun,et al.  Retrieving answers from frequently asked questions pages on the web , 2005, CIKM '05.

[7]  Rutger van Haasteren,et al.  Gibbs Sampling , 2010, Encyclopedia of Machine Learning.

[8]  Scott Grant,et al.  Estimating the Optimal Number of Latent Concepts in Source Code Analysis , 2010, 2010 10th IEEE Working Conference on Source Code Analysis and Manipulation.

[9]  Michael W. Godfrey,et al.  Automated topic naming to support cross-project analysis of software maintenance activities , 2011, MSR '11.

[10]  Michael Gertz,et al.  Mining email social networks , 2006, MSR '06.

[11]  Brian D. Davison,et al.  A classification-based approach to question answering in discussion boards , 2009, SIGIR.

[12]  Ahmed E. Hassan,et al.  Recovering a Balanced Overview of Topics in a Software Domain , 2011, 2011 IEEE 11th International Working Conference on Source Code Analysis and Manipulation.

[13]  Chris Laffra,et al.  Official Eclipse 3.0 FAQs , 2004 .

[14]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[15]  W. Bruce Croft,et al.  A framework to predict the quality of answers with non-textual features , 2006, SIGIR.

[16]  Kristian J. Hammond,et al.  Question Answering from Frequently Asked Question Files: Experiences with the FAQ FINDER System , 1997, AI Mag..

[17]  Yang Cai,et al.  Api hyperlinking via structural overlap , 2009, ESEC/SIGSOFT FSE.

[18]  Young-In Song,et al.  Finding question-answer pairs from online forums , 2008, SIGIR '08.

[19]  Kathleen McKeown,et al.  Detection of Question-Answer Pairs in Email Conversations , 2004, COLING.

[20]  Asli Çelikyilmaz,et al.  A Graph-based Semi-Supervised Learning for Question-Answering , 2009, ACL.

[21]  Mira Mezini,et al.  Mining subclassing directives to improve framework reuse , 2010, 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010).

[22]  Ahmed E. Hassan,et al.  Modeling the evolution of topics in source code histories , 2011, MSR '11.

[23]  Mihai Surdeanu,et al.  Learning to Rank Answers on Large Online QA Collections , 2008, ACL.

[24]  Westley Weimer,et al.  Automatic documentation inference for exceptions , 2008, ISSTA '08.

[25]  Gokhan Tur,et al.  LDA Based Similarity Modeling for Question Answering , 2010, HLT-NAACL 2010.

[26]  Emily Hill,et al.  Towards automatically generating summary comments for Java methods , 2010, ASE.

[27]  W. Bruce Croft,et al.  Finding semantically similar questions based on their answers , 2005, SIGIR '05.