Hybrid index maintenance for growing text collections

We present a new family of hybrid index maintenance strategies to be used in on-line index construction for monotonically growing text collections. These new strategies improve upon recent results for hybrid index maintenance in dynamic text retrieval systems. Like previous techniques, our new method distinguishes between short and long posting lists: While short lists are maintained using a merge strategy, long lists are kept separate and are updated in-place. This way, costly relocations of long posting lists are avoided.We discuss the shortcomings of previous hybrid methods and give an experimental evaluation of the new technique, showing that its index maintenance performance is superior to that of the earlier methods, especially when the amount of main memory available to the indexing system is small. We also present a complexity analysis which proves that, under a Zipfian term distribution, the asymptotical number of disk accesses performed by the best hybrid maintenance strategy is linear in the size of the text collection, implying the asymptotical optimality of the proposed strategy.

[1]  Hector Garcia-Molina,et al.  Incremental updates of inverted lists for text document retrieval , 1994, SIGMOD '94.

[2]  Alistair Moffat,et al.  Fast on-line index construction by geometric partitioning , 2005, CIKM '05.

[3]  Alistair Moffat,et al.  In Situ Generation of Compressed Inverted Files , 1995, J. Am. Soc. Inf. Sci..

[4]  Kotagiri Ramamohanarao,et al.  Inverted files versus signature files for text indexing , 1998, TODS.

[5]  Wann-Yun Shieh,et al.  A statistics-based approach to incrementally update inverted files , 2005, Inf. Process. Manag..

[6]  Jan O. Pedersen,et al.  Optimization for dynamic inverted index maintenance , 1989, SIGIR '90.

[7]  Charles L. A. Clarke,et al.  A Hybrid Approach to Index Maintenance in Dynamic Text Retrieval Systems , 2006, ECIR.

[8]  T. Chiueh,et al.  Eecient Real-time Index Updates in Text Retrieval Systems , 1999 .

[9]  Hector Garcia-Molina,et al.  Synthetic workload performance analysis of incremental updates , 1994, SIGIR '94.

[10]  Hugh E. Williams,et al.  In-memory hash tables for accumulating text vocabularies , 2001, Inf. Process. Lett..

[11]  Hugh E. Williams,et al.  In-Place versus Re-Build versus Re-Merge: Index Maintenance Strategies for Text Retrieval Systems , 2004, ACSC.

[12]  Hugh E. Williams,et al.  Compression of inverted indexes For fast query evaluation , 2002, SIGIR '02.

[13]  Charles L. A. Clarke,et al.  Indexing time vs. query time: trade-offs in dynamic information retrieval systems , 2005, CIKM '05.

[14]  Justin Zobel,et al.  Efficient single-pass index construction for text databases , 2003, J. Assoc. Inf. Sci. Technol..

[15]  Z. K. Silagadze,et al.  Citations and the Zipf-Mandelbrot Law , 1999, Complex Syst..

[16]  Hugh E. Williams,et al.  Efficient online index maintenance for contiguous inverted lists , 2006, Inf. Process. Manag..

[17]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .