Feature Analysis for Duplicate Detection in Programming QA Communities

In community question answering (CQA), duplicate questions are questions that were previously created and answered but occur again. These questions produce noises in the CQA websites which impede users to find answers efficiently. Programming CQA (PCQA), a branch of CQA that holds questions related to programming, also suffers from this problem. Existing works on duplicate detection in PCQA websites framed the task as a supervised learning task on the question pairs, and relied on a number of extracted features of the question pairs. But they extracted only textual features and did not consider the source code in the questions, which are linguistically very different to natural languages. Our work focuses on developing novel features for PCQA duplicate detection. We leverage continuous word vectors from the deep learning literature, probabilistic models in information retrieval and association pairs mined from duplicate questions using machine translation. We provide extensive empirical analysis on the performance of these features and their various combinations using a range of learning models. Our work could be helpful for both research works and practical applications that require extracting features from texts that are not all natural languages.

[1]  David Lo,et al.  Multi-Factor Duplicate Question Detection in Stack Overflow , 2015, Journal of Computer Science and Technology.

[2]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[3]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[4]  Timothy Baldwin,et al.  An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation , 2016, Rep4NLP@ACL.

[5]  C. J. van Rijsbergen,et al.  Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[6]  Chanchal Kumar Roy,et al.  Mining Duplicate Questions of Stack Overflow , 2016, 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR).

[7]  Strother H. Walker,et al.  Estimation of the probability of an event as a function of several independent variables. , 1967, Biometrika.

[8]  Hermann Ney,et al.  The Alignment Template Approach to Statistical Machine Translation , 2004, CL.

[9]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[10]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[11]  Quan Z. Sheng,et al.  Detecting Duplicate Posts in Programming QA Communities via Latent Semantics and Association Rules , 2017, WWW.

[12]  Tong Zhang,et al.  Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[13]  Tao Qin,et al.  LETOR: A benchmark collection for research on learning to rank for information retrieval , 2010, Information Retrieval.

[14]  Éric Gaussier,et al.  Information-based models for ad hoc IR , 2010, SIGIR '10.

[15]  Fang Liu,et al.  Improving Question Retrieval in Community Question Answering Using World Knowledge , 2013, IJCAI.

[16]  G. Golub,et al.  Updating formulae and a pairwise algorithm for computing sample variances , 1979 .

[17]  Ming Zhou,et al.  Answering Questions with Complex Semantic Constraints on Open Knowledge Bases , 2015, CIKM.

[18]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[19]  Yong Yu,et al.  Analyzing and Predicting Not-Answered Questions in Community-based Question Answering Services , 2011, AAAI.

[20]  Ashish Sureka,et al.  Chaff from the wheat: characterization and modeling of deleted questions on stack overflow , 2014, WWW.

[21]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[22]  Christian S. Jensen,et al.  Approaches to Exploring Category Information for Question Retrieval in Community Question-Answer Archives , 2012, TOIS.

[23]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[24]  Andrew Chou,et al.  Semantic Parsing on Freebase from Question-Answer Pairs , 2013, EMNLP.

[25]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[26]  Idan Szpektor,et al.  Learning from the past: answering new questions with past answers , 2012, WWW.

[27]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[28]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[29]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[30]  Jonathan Berant,et al.  Semantic Parsing via Paraphrasing , 2014, ACL.