Building a Large Multilingual Test Collection from Comparable News Documents

We present a novel approach to constructing a large test collection for evaluation of information retrieval systems. This approach relies on a collection of time-sensitive documents, like news stories, a particular class of query topics relating to unexpected events, and a particularly strict definition of the notion of relevance. We have used our approach to construct a large multilingual test collection of news stories in German and Italian. We will also construct a similar test collection in French. The document collection is based on roughly 100,000 news stories from the Swiss news agency, SDA. We have developed a query set of 65 topics relating to unpredicted world news events and we have made relevance judgments for these queries over the document collection by limiting the space of documents considered possibly relevant for each query. We have already successfully used our multilingual test collection in evaluating the performance of word normalisation modules we have developed for German and Italian and in evaluating the performance of our SPIDER retrieval system in performing cross-language retrieval tasks, retrieving Italian documents in response to queries entered in German.