Questionable Answers in Question Answering Research: Reproducibility and Variability of Published Results

“Based on theoretical reasoning it has been suggested that the reliability of findings published in the scientific literature decreases with the popularity of a research field” (Pfeiffer and Hoffmann, 2009). As we know, deep learning is very popular and the ability to reproduce results is an important part of science. There is growing concern within the deep learning community about the reproducibility of results that are presented. In this paper we present a number of controllable, yet unreported, effects that can substantially change the effectiveness of a sample model, and thusly the reproducibility of those results. Through these environmental effects we show that the commonly held belief that distribution of source code is all that is needed for reproducibility is not enough. Source code without a reproducible environment does not mean anything at all. In addition the range of results produced from these effects can be larger than the majority of incremental improvement reported.

[1]  Meng Zhang,et al.  Neural Network Methods for Natural Language Processing , 2017, Computational Linguistics.

[2]  Craig MacDonald,et al.  Toward Reproducible Baselines: The Open-Source IR Reproducibility Challenge , 2016, ECIR.

[3]  Jimmy J. Lin,et al.  Experiments with Convolutional Neural Network Models for Answer Selection , 2017, SIGIR.

[4]  Qinmin Hu,et al.  Enhancing Recurrent Neural Networks with Positional Attention for Question Answering , 2017, SIGIR.

[5]  Iryna Gurevych,et al.  Optimal Hyperparameters for Deep LSTM-Networks for Sequence Labeling Tasks , 2017, ArXiv.

[6]  W. Bruce Croft,et al.  On the Benefit of Incorporating External Features in a Neural Architecture for Answer Sentence Selection , 2017, SIGIR.

[7]  Tat-Seng Chua,et al.  Question answering passage retrieval using dependency relations , 2005, SIGIR '05.

[8]  Andrew Trotman,et al.  Report on the SIGIR 2015 Workshop on Reproducibility, Inexplicability, and Generalizability of Results (RIGOR) , 2016, SIGF.

[9]  Jimmy J. Lin,et al.  Multi-Perspective Sentence Similarity Modeling with Convolutional Neural Networks , 2015, EMNLP.

[10]  Noah A. Smith,et al.  Tree Edit Models for Recognizing Textual Entailments, Paraphrases, and Answers to Questions , 2010, NAACL.

[11]  Shuohang Wang,et al.  A Compare-Aggregate Model for Matching Text Sequences , 2016, ICLR.

[12]  Zhiguo Wang,et al.  Sentence Similarity Learning by Lexical Decomposition and Composition , 2016, COLING.

[13]  Jimmy J. Lin,et al.  Exploring the Effectiveness of Convolutional Neural Networks for Answer Selection in End-to-End Question Answering , 2017, ArXiv.

[14]  Zhiguo Wang,et al.  FAQ-based Question Answering via Word Alignment , 2015, ArXiv.

[15]  Jun Zhao,et al.  Inner Attention based Recurrent Neural Networks for Answer Selection , 2016, ACL.

[16]  Wenpeng Yin,et al.  Task-Specific Attentive Pooling of Phrase Alignments Contributes to Sentence Matching , 2017, EACL.

[17]  Jimmy J. Lin,et al.  Pairwise Word Interaction Modeling with Deep Neural Networks for Semantic Similarity Measurement , 2016, NAACL.

[18]  Alessandro Moschitti,et al.  Learning to Rank Short Text Pairs with Convolutional Deep Neural Networks , 2015, SIGIR.

[19]  Christopher D. Manning,et al.  Probabilistic Tree-Edit Models with Structured Latent Variables for Textual Entailment and Question Answering , 2010, COLING.

[20]  S. K. Park,et al.  Random number generators: good ones are hard to find , 1988, CACM.

[21]  Dan Roth,et al.  Mapping Dependencies Trees: An Application to Question Answering , 2003 .

[22]  Alex M. Warren Repeatability and Benefaction in Computer Systems Research — A Study and a Modest Proposal , 2015 .

[23]  Alessandro Moschitti,et al.  Automatic Feature Engineering for Answer Selection and Extraction , 2013, EMNLP.

[24]  Alex Fit-Florea,et al.  Precision and Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs , 2011 .

[25]  Christian Collberg,et al.  Measuring Reproducibility in Computer Systems Research , 2014 .

[26]  Di Wang,et al.  A Long Short-Term Memory Model for Answer Sentence Selection in Question Answering , 2015, ACL.

[27]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[28]  Ming-Wei Chang,et al.  Question Answering Using Enhanced Lexical Semantic Models , 2013, ACL.

[29]  Gu-Yeon Wei,et al.  Deep Learning for Computer Architects , 2017, Synthesis Lectures on Computer Architecture.

[30]  Alistair Moffat,et al.  Improvements that don't add up: ad-hoc retrieval results since 1998 , 2009, CIKM.

[31]  Bowen Zhou,et al.  LSTM-based Deep Learning Models for non-factoid answer selection , 2015, ArXiv.

[32]  Yi Yang,et al.  WikiQA: A Challenge Dataset for Open-Domain Question Answering , 2015, EMNLP.

[33]  Lei Yu,et al.  Deep Learning for Answer Sentence Selection , 2014, ArXiv.

[34]  Bowen Zhou,et al.  ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs , 2015, TACL.

[35]  Chris Callison-Burch,et al.  Answer Extraction as Sequence Tagging with Tree Edit Distance , 2013, NAACL.

[36]  Thomas Pfeiffer,et al.  Large-Scale Assessment of the Effect of Popularity on the Reliability of Research , 2009, PloS one.

[37]  Noah A. Smith,et al.  What is the Jeopardy Model? A Quasi-Synchronous Grammar for QA , 2007, EMNLP.

[38]  Bowen Zhou,et al.  Attentive Pooling Networks , 2016, ArXiv.

[39]  Bowen Zhou,et al.  Applying deep learning to answer selection: A study and an open task , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[40]  Phil Blunsom,et al.  Neural Variational Inference for Text Processing , 2015, ICML.

[41]  M. Baker 1,500 scientists lift the lid on reproducibility , 2016, Nature.

[42]  W. Bruce Croft,et al.  aNMM: Ranking Short Answer Texts with Attention-Based Neural Matching Model , 2016, CIKM.