Building a web test collection using social media

Community Question Answering (CQA) platforms contain a large number of questions and associated answers. Answerers sometimes include URLs as part of the answers to provide further information. This paper describes a novel way of building a test collection for web search by exploiting the link information from this type of social media data. We propose to build the test collection by regarding CQA questions as queries and the associated linked web pages as relevant documents. To evaluate this approach, we collect approximately ten thousand CQA queries, whose answers contained links to ClueWeb09 documents after spam filtering. Experimental results using this collection show that the relative effectiveness between different retrieval models on the ClueWeb-CQA query set is consistent with that on the TREC Web Track query sets, confirming the reliability of our test collection. Further analysis shows that the large number of queries generated through this approach compensates for the sparse relevance judgments in determining significant differences.