Content Locality in Time-Ordered Document Collections

Using newswire data sources from the TREC corpus, we show that the distribution of relevant documents with respect to time can be decidely non-uniform. Many TREC topics show timebased clustering of relevant documents. We denote this clustering content locality and provide a simple metric for its measurement in time-ordered document collections. There is a marked positive correlation between content locality measurements from two time-sychronized data sources. Given this correlation, we show that knowledge of the distribution of content locality in one document source can provide modest improvement in retrieval results in a companion, time-synchronized document source. While this data is preliminary, it illustrates the potential of using time as an additional feature in retrieval.