Neural Duplicate Question Detection without Labeled Training Data

Supervised training of neural models to duplicate question detection in community Question Answering (CQA) requires large amounts of labeled question pairs, which can be costly to obtain. To minimize this cost, recent works thus often used alternative methods, e.g., adversarial domain adaptation. In this work, we propose two novel methods—weak supervision using the title and body of a question, and the automatic generation of duplicate questions—and show that both can achieve improved performances even though they do not require any labeled data. We provide a comparison of popular training strategies and show that our proposed approaches are more effective in many cases because they can utilize larger amounts of data from the CQA forums. Finally, we show that weak supervision with question title and body information is also an effective method to train CQA answer selection models without direct answer supervision.

[1]  Preslav Nakov,et al.  Adversarial Domain Adaptation for Duplicate Question Detection , 2018, EMNLP.

[2]  Wei Wu,et al.  Question Condensing Networks for Answer Selection in Community Question Answering , 2018, ACL.

[3]  Zhoujun Li,et al.  Question Retrieval with High Quality Answers in Community Question Answering , 2014, CIKM.

[4]  Yong Zhang,et al.  Concept Embedded Convolutional Semantic Model for Question Retrieval , 2017, WSDM.

[5]  Iryna Gurevych,et al.  Improved Cross-Lingual Question Retrieval for Community Question Answering , 2019, WWW.

[6]  Yao Zhao,et al.  Paragraph-level Neural Question Generation with Maxout Pointer and Gated Self-attention Networks , 2018, EMNLP.

[7]  Xuanjing Huang,et al.  Convolutional Neural Tensor Network Architecture for Community-Based Question Answering , 2015, IJCAI.

[8]  Alessandro Moschitti,et al.  Semi-supervised Question Retrieval with Gated Convolutions , 2015, NAACL.

[9]  Jason Weston,et al.  A Neural Attention Model for Abstractive Sentence Summarization , 2015, EMNLP.

[10]  Siu Cheung Hui,et al.  Learning to Rank Question Answer Pairs with Holographic Dual LSTM Architecture , 2017, SIGIR.

[11]  Richard Socher,et al.  The Natural Language Decathlon: Multitask Learning as Question Answering , 2018, ArXiv.

[12]  Tong Wang,et al.  Neural Models for Key Phrase Extraction and Question Generation , 2017, QA@ACL.

[13]  Xinya Du,et al.  Harvesting Paragraph-level Question-Answer Pairs from Wikipedia , 2018, ACL.

[14]  Iryna Gurevych,et al.  Representation Learning for Answer Selection with LSTM-Based Importance Weighting , 2017, IWCS.

[15]  Li Cai,et al.  Phrase-Based Translation Model for Question Retrieval in Community Question Answer Archives , 2011, ACL.

[16]  Samy Bengio,et al.  Tensor2Tensor for Neural Machine Translation , 2018, AMTA.

[17]  Cícero Nogueira dos Santos,et al.  Learning Hybrid Representations to Retrieve Semantically Equivalent Questions , 2015, ACL.

[18]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[19]  Ben He,et al.  Question-answer topic model for question retrieval in community question answering , 2012, CIKM.

[20]  Christian S. Jensen,et al.  Approaches to Exploring Category Information for Question Retrieval in Community Question-Answer Archives , 2012, TOIS.

[21]  Ming Zhou,et al.  Question Generation for Question Answering , 2017, EMNLP.

[22]  Preslav Nakov,et al.  SemEval-2017 Task 3: Community Question Answering , 2017, *SEMEVAL.

[23]  Iryna Gurevych,et al.  COALA: A Neural Coverage-Based Approach for Long Answer Selection with Small Data , 2019, AAAI.

[24]  Pushpak Bhattacharyya,et al.  Can Taxonomy Help? Improving Semantic Question Matching using Question Taxonomy , 2018, COLING.

[25]  Xinya Du,et al.  Learning to Ask: Neural Question Generation for Reading Comprehension , 2017, ACL.

[26]  Alexander M. Rush,et al.  Abstractive Sentence Summarization with Attentive Recurrent Neural Networks , 2016, NAACL.

[27]  Preslav Nakov,et al.  Cross-language Learning with Adversarial Neural Networks , 2017, CoNLL.

[28]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[29]  Alberto Barrón-Cedeño,et al.  Learning to Re-Rank Questions in Community Question Answering Using Advanced Features , 2016, CIKM.

[30]  Yunfang Wu,et al.  An Unsupervised Model with Attention Autoencoders for Question Retrieval , 2018, AAAI.

[31]  Daniele Bonadiman,et al.  Effective shared representations with Multitask Learning for Community Question Answering , 2017, EACL.

[32]  W. Bruce Croft,et al.  Retrieval models for question and answer archives , 2008, SIGIR '08.

[33]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[34]  Yonatan Belinkov,et al.  Neural Attention for Learning to Rank Questions in Community Question Answering , 2016, COLING.

[35]  Daniele Bonadiman,et al.  Injecting Relational Structural Representation in Neural Networks for Question Similarity , 2018, ACL.

[36]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[37]  Iryna Gurevych,et al.  Concatenated Power Mean Word Embeddings as Universal Cross-Lingual Sentence Representations , 2018, 1803.01400.

[38]  Bowen Zhou,et al.  Improved Representation Learning for Question Answer Matching , 2016, ACL.

[39]  W. Bruce Croft,et al.  Finding similar questions in large question and answer archives , 2005, CIKM '05.