Infrastructure support for evaluation as a service

How do we conduct large-scale community-wide evaluations for information retrieval if we are unable to distribute the document collection? This was the challenge we faced in organizing a task on searching tweets at the Text Retrieval Conference (TREC), since Twitter's terms of service forbid redistribution of tweets. Our solution, which we call "evaluation as a service", was to provide an API through which the collection can be accessed for completing the evaluation task. This paper describes the infrastructure underlying the service and its deployment at TREC 2013. We discuss the merits of the approach and potential applicability to other evaluation scenarios.