Index maintenance for time-travel text search

Time-travel text search enriches standard text search by temporal predicates, so that users of web archives can easily retrieve document versions that are considered relevant to a given keyword query and existed during a given time interval. Different index structures have been proposed to efficiently support time-travel text search. None of them, however, can easily be updated as the Web evolves and new document versions are added to the web archive. In this work, we describe a novel index structure that efficiently supports time-travel text search and can be maintained incrementally as new document versions are added to the web archive. Our solution uses a sharded index organization, bounds the number of spuriously read index entries per shard, and can be maintained using small in-memory buffers and append-only operations. We present experiments on two large-scale real-world datasets demonstrating that maintaining our novel index structure is an order of magnitude more efficient than periodically rebuilding one of the existing index structures, while query-processing performance is not adversely affected.

[1]  Charles L. A. Clarke,et al.  Information Retrieval - Implementing and Evaluating Search Engines , 2010 .

[2]  Torsten Suel,et al.  Improved index compression techniques for versioned document collections , 2010, CIKM '10.

[3]  Torsten Suel,et al.  Efficient search in large textual collections with redundancy , 2007, WWW '07.

[4]  I. C. Mogotsi,et al.  Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze: Introduction to information retrieval , 2010, Information Retrieval.

[5]  Hugh E. Williams,et al.  Efficient online index maintenance for contiguous inverted lists , 2006, Inf. Process. Manag..

[6]  Patrick E. O'Neil,et al.  The log-structured merge-tree (LSM-tree) , 1996, Acta Informatica.

[7]  Vassilis J. Tsotras,et al.  Comparison of access methods for time-evolving data , 1999, CSUR.

[8]  Torsten Suel,et al.  Faster top-k document retrieval using block-max indexes , 2011, SIGIR.

[9]  Alistair Moffat,et al.  Efficient online index construction for text databases , 2008, TODS.

[10]  Srikanta J. Bedathur,et al.  Efficient temporal keyword search over versioned text , 2010, CIKM.

[11]  Srikanta J. Bedathur,et al.  Temporal index sharding for space-time efficiency in archive search , 2011, SIGIR.

[12]  Michael Herscovici,et al.  Efficient Indexing of Versioned Document Sequences , 2007, ECIR.

[13]  Gerhard Weikum,et al.  A Time Machine for Text Search , 2022 .

[14]  Gerhard Weikum,et al.  The LHAM log-structured history data access method , 2000, The VLDB Journal.

[15]  Charles L. A. Clarke,et al.  Hybrid index maintenance for contiguous inverted lists , 2007, Information Retrieval.

[16]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[17]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[18]  W. Bruce Croft,et al.  Efficient document retrieval in main memory , 2007, SIGIR.

[19]  Alistair Moffat,et al.  Self-indexing inverted files for fast text retrieval , 1996, TOIS.

[20]  Bernhard Seeger,et al.  An asymptotically optimal multiversion B-tree , 1996, The VLDB Journal.

[21]  Frank Dabek,et al.  Large-scale Incremental Processing Using Distributed Transactions and Notifications , 2010, OSDI.

[22]  P. Sreenivasa Kumar,et al.  On-line index maintenance using horizontal partitioning , 2009, CIKM.

[23]  Torsten Suel,et al.  Compact full-text indexing of versioned document collections , 2009, CIKM.

[24]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[25]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[26]  Andrei Z. Broder,et al.  Efficient query evaluation using a two-level retrieval process , 2003, CIKM '03.

[27]  Peter G. Anick,et al.  Versioning a full-text information retrieval system , 1992, SIGIR '92.

[28]  Gerhard Weikum,et al.  A Log-Structured History Data Access Method (LHAM) , 1993, HPTS.