Improving Question Answering for Reading Comprehension Tests by Combining Multiple Systems

Most work on reading comprehension question answering systems has focused on improving performance by adding complex natural language processing (NLP) components to such systems rather than by combining the output of multiple systems. Our paper empirically evaluates whether combining the outputs of seven such systems submitted as the final projects for a graduate level class can improve over the performance of any individual system. We present several analyses of our combination experiments, including performance bounds, impact of both tie-breaking methods and ensemble size on performance, and an error analysis. Our results, replicated using two different publicly available reading test corpora, demonstrate the utility of system combination via majority voting in our restricted domain question answering task.