Duplicate Removal for Candidate Answer Sentences

In this paper, we describe the duplicate removal component of Infolab’s1 question answering system that contributed to CSAIL’s entry of TREC-152 Question Answering track. The goal of the Question Answering Track is to provide short, succinct answers to English sentences posed by users. In answering definition questions, we are asked to retrieve new and relevant information, in the form of short sentences or fragments from newswire text. Because many news articles overlap in content, we need to employ a duplicate removal method before presenting the results to the users. Here we present two different approaches to duplicate removal. Our first approach uses the BLEU score, a commonly used metric for machine translation evaluation, as the similarity metric between sentences. Our second approach takes a list of candidate answers, and clusters answers using word-level edit-distance as the similarity metric; the best answer from each cluster is chosen as the representative. In this paper we compare these two approaches and determine their relative performances in the duplicate detection task.