1 Duplicate Detection in ProgrammingQuestion Answering Communities

Community based Question Answering (CQA) websites are attracting increasing number of users and contributors in recent years. However, duplicate questions frequently occur in CQA websites and are currently manually identified by the moderators. Automatic duplicate detection, on one hand will alleviate this laborious effort from moderators before taking close actions, on the other hand helps question issuers quickly find answers. A number of studies have looked into related problems, however, very limited works target the duplicate detection in Programming CQA (PCQA), a branch of CQA that is dedicated to programmers. Existing works framed the task as a supervised learning problem on the question pairs and relied on only textual features. Moreover, the issue of selecting candidate duplicates from large volume of historical questions is often un-addressed. To tackle these issues, we model the duplicate detection as a two-stage “ranking-classification” problem over question pairs. In the first stage, we rank the historical questions according to their similarities to the new issued question and select the top ranked ones as candidates to reduce the search space. In the second stage, we develop novel features that capture both textual similarity and latent semantics on question pairs, leveraging techniques in deep learning and information retrieval literature. Experiments on real-world questions about multiple programming languages demonstrate that our method works very well; in some cases up to 25% improvement compared to the state-of-the-art benchmarks.

[1]  Strother H. Walker,et al.  Estimation of the probability of an event as a function of several independent variables. , 1967, Biometrika.

[2]  G. Golub,et al.  Updating formulae and a pairwise algorithm for computing sample variances , 1979 .

[3]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[4]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[5]  N. Altman An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression , 1992 .

[6]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[7]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[8]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[10]  C. J. van Rijsbergen,et al.  Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[11]  Koby Crammer,et al.  Online Passive-Aggressive Algorithms , 2003, J. Mach. Learn. Res..

[12]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[13]  Tong Zhang,et al.  Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[14]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[15]  Hermann Ney,et al.  The Alignment Template Approach to Statistical Machine Translation , 2004, CL.

[16]  Tat-Seng Chua,et al.  Paraphrase Recognition via Dissimilarity Significance Classification , 2006, EMNLP.

[17]  Carlo Strapparava,et al.  Corpus-based and Knowledge-based Measures of Text Semantic Similarity , 2006, AAAI.

[18]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[19]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[20]  Kai Wang,et al.  A syntactic tree matching approach to finding similar questions in community-based qa services , 2009, SIGIR.

[21]  Éric Gaussier,et al.  Information-based models for ad hoc IR , 2010, SIGIR '10.

[22]  Christian S. Jensen,et al.  A generalized framework of exploring category information for question retrieval in community question answer archives , 2010, WWW '10.

[23]  Yong Yu,et al.  Analyzing and Predicting Not-Answered Questions in Community-based Question Answering Services , 2011, AAAI.

[24]  Jeffrey Pennington,et al.  Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection , 2011, NIPS.

[25]  Christoph Treude,et al.  How do programmers ask and answer questions on the web?: NIER track , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[26]  Idan Szpektor,et al.  Learning from the past: answering new questions with past answers , 2012, WWW.

[27]  Christian S. Jensen,et al.  Approaches to Exploring Category Information for Question Retrieval in Community Question-Answer Archives , 2012, TOIS.

[28]  Nitin Madnani,et al.  Re-examining Machine Translation Metrics for Paraphrase Identification , 2012, NAACL.

[29]  Jacob Eisenstein,et al.  Discriminative Improvements to Distributional Sentence Similarity , 2013, EMNLP.

[30]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[31]  Fang Liu,et al.  Improving Question Retrieval in Community Question Answering Using World Knowledge , 2013, IJCAI.

[32]  Andrew Chou,et al.  Semantic Parsing on Freebase from Question-Answer Pairs , 2013, EMNLP.

[33]  Ashish Sureka,et al.  Chaff from the wheat: characterization and modeling of deleted questions on stack overflow , 2014, WWW.

[34]  Jonathan Berant,et al.  Semantic Parsing via Paraphrasing , 2014, ACL.

[35]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[36]  David Lo,et al.  Multi-Factor Duplicate Question Detection in Stack Overflow , 2015, Journal of Computer Science and Technology.

[37]  Ming Zhou,et al.  Answering Questions with Complex Semantic Constraints on Open Knowledge Bases , 2015, CIKM.

[38]  Jimmy J. Lin,et al.  Multi-Perspective Sentence Similarity Modeling with Convolutional Neural Networks , 2015, EMNLP.

[39]  Chanchal Kumar Roy,et al.  Mining Duplicate Questions of Stack Overflow , 2016, 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR).

[40]  Timothy Baldwin,et al.  An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation , 2016, Rep4NLP@ACL.

[41]  Aixin Sun,et al.  Topic Modeling for Short Texts with Auxiliary Word Embeddings , 2016, SIGIR.

[42]  Quan Z. Sheng,et al.  Detecting Duplicate Posts in Programming QA Communities via Latent Semantics and Association Rules , 2017, WWW.