Can Duplicate Questions on Stack Overflow Benefit the Software Development Community?

Duplicate questions on Stack Overflow are questions that are flagged as being conceptually equivalent to a previously posted question. Stack Overflow suggests that duplicate questions should not be discussed by users, but rather that attention should be redirected to their previously posted counterparts. Roughly 53% of closed Stack Overflow posts are closed due to duplication. Despite their supposed overlapping content, user activity suggests duplicates may generate additional or superior answers. Approximately 9% of duplicates receive more views than their original counterparts despite being closed. In this paper, we analyze duplicate questions from two perspectives. First, we analyze the experience of those who post duplicates using activity and reputation-based heuristics. Second, we compare the content of duplicates both in terms of their questions and answers to determine the degree of similarity between each duplicate pair. Through analysis of the MSR challenge dataset, we find that although duplicate questions are more likely to be created by inexperienced users, they often receive dissimilar answers to their original counterparts. Indeed, supplementary textual analysis using Natural Language Processing (NLP) techniques suggests duplicate questions provide additional information about the underlying concepts being discussed. We recommend that the Stack Overflow's duplication policy be revised to account for the benefits that leaving duplicate questions open may have for the developer community.

[1]  Ashish Sureka,et al.  Fit or unfit: analysis and prediction of 'closed questions' on stack overflow , 2013, COSN '13.

[2]  Eleni Stroulia,et al.  Detecting duplicate bug reports with software engineering domain knowledge , 2015, 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[3]  Timothy Baldwin,et al.  Detecting Misflagged Duplicate Questions in Community Question-Answering Archives , 2018, ICWSM.

[4]  Thomas Zimmermann,et al.  Duplicate bug reports considered harmful … really? , 2008, 2008 IEEE International Conference on Software Maintenance.

[5]  David Lo,et al.  Multi-Factor Duplicate Question Detection in Stack Overflow , 2015, Journal of Computer Science and Technology.

[6]  Christoph Treude,et al.  SOTorrent: Studying the Origin, Evolution, and Usage of Stack Overflow Code Snippets , 2018, 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR).

[7]  Ali Mesbah,et al.  Mining questions asked by web developers , 2014, MSR 2014.

[8]  Eleni Stroulia,et al.  On the Personality Traits of StackOverflow Users , 2013, 2013 IEEE International Conference on Software Maintenance.

[9]  Michele Lanza,et al.  Harnessing Stack Overflow for the IDE , 2012, 2012 Third International Workshop on Recommendation Systems for Software Engineering (RSSE).

[10]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[11]  Elmar Jürgens,et al.  Do code clones matter? , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[12]  Ahmed E. Hassan,et al.  What are developers talking about? An analysis of topics and trends in Stack Overflow , 2014, Empirical Software Engineering.

[13]  Chanchal Kumar Roy,et al.  Mining Duplicate Questions of Stack Overflow , 2016, 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR).

[14]  D. V. Koznov,et al.  Detecting Near Duplicates in Software Documentation , 2017, Program. Comput. Softw..

[15]  Gabriele Bavota,et al.  Mining StackOverflow to turn the IDE into a self-confident programming prompter , 2014, MSR 2014.

[16]  Robert E. Kraut,et al.  Early detection of potential experts in question answering communities , 2011, UMAP'11.