Efficient Time-Travel on Versioned Text Collections

The availability of versioned text collections such as the Internet Archive opens up opportunities for time-aware exploration of their contents. In this paper, we propose time-travel retrieval and ranking that extends traditional keyword queries with a temporal context in which the query should be evaluated. More precisely, the query is evaluated over all states of the collection that existed during the temporal context. In order to support these queries, we make key contributions in (i) defining extensions to well-known relevance models that take into account the temporal context of the query and the version history of documents, (ii) designing an immortal index over the full versioned text collection that avoids a blowup in index size, and (iii) making the popular NRA algorithm for top-k query processing aware of the temporal context. We present preliminary experimental analysis over the English Wikipedia revision history showing that the proposed techniques are both effective and efficient.

[1]  Vassilis J. Tsotras,et al.  Comparison of access methods for time-evolving data , 1999, CSUR.

[2]  Peter G. Anick,et al.  Versioning a full-text information retrieval system , 1992, SIGIR '92.

[3]  Torsten Suel,et al.  Optimal Histograms with Quality Guarantees , 1998, VLDB.

[4]  Richard T. Snodgrass,et al.  Coalescing in Temporal Databases , 1996, VLDB.

[5]  Yannis E. Ioannidis,et al.  Balancing histogram optimality and practicality for query result size estimation , 1995, SIGMOD '95.

[6]  Stephen E. Robertson,et al.  Okapi/Keenbow at TREC-8 , 1999, TREC.

[7]  Eamonn J. Keogh,et al.  An online algorithm for segmenting time series , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[8]  Francesco Romani,et al.  Ranking a stream of news , 2005, WWW '05.

[9]  Sudipto Guha,et al.  REHIST: Relative Error Histogram Construction Algorithms , 2004, VLDB.

[10]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[11]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[12]  Hector Garcia-Molina,et al.  Wave-indices: indexing evolving databases , 1997, SIGMOD '97.

[13]  Evimaria Terzi,et al.  Efficient Algorithms for Sequence Segmentation , 2006, SDM.

[14]  Sushil Jajodia,et al.  Temporal Databases: Theory, Design, and Implementation , 1993 .

[15]  Kjetil Nørvåg,et al.  DyST: Dynamic and Scalable Temporal Text Indexing , 2006, Thirteenth International Symposium on Temporal Representation and Reasoning (TIME'06).

[16]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS.

[17]  Charles L. A. Clarke,et al.  Hybrid index maintenance for growing text collections , 2006, SIGIR.