Efficient on-line index maintenance for dynamic text collections by using dynamic balancing tree

Previous on-line index maintenance strategies are mainly designed for document insertions without considering document deletions. In a truly dynamic search environment, however, documents may be added to and removed from the collection at any point in time. In this paper, we examine issues of on-line index maintenance with support for instantaneous document deletions and insertions. We present a DBT Merge strategy that can dynamically adjust the sequence of sub-index merge operations during index construction, and offers better query processing performance than previous methods, while providing an equivalent level of index maintenance performance when document insertions and deletions exist in parallel. Using experiments on 426 GB of web data we demonstrate the efficiency of our method in practice, showing that on-line index construction for dynamic text collections can be performed efficiently and almost as fast as for growing text collections.

[1]  Justin Zobel,et al.  Efficient single-pass index construction for text databases , 2003, J. Assoc. Inf. Sci. Technol..

[2]  William R. Hersh,et al.  Managing Gigabytes—Compressing and Indexing Documents and Images (Second Edition) , 2001, Information Retrieval.

[3]  Andrei Z. Broder,et al.  Efficient query evaluation using a two-level retrieval process , 2003, CIKM '03.

[4]  Kotagiri Ramamohanarao,et al.  Inverted files versus signature files for text indexing , 1998, TODS.

[5]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[6]  Wann-Yun Shieh,et al.  A statistics-based approach to incrementally update inverted files , 2005, Inf. Process. Manag..

[7]  Charles L. A. Clarke,et al.  Hybrid index maintenance for growing text collections , 2006, SIGIR.

[8]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[9]  Ron Sacks-Davis,et al.  Efficient passage ranking for document databases , 1999, TOIS.

[10]  Charles L. A. Clarke,et al.  Indexing time vs. query time: trade-offs in dynamic information retrieval systems , 2005, CIKM '05.

[11]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[12]  Hector Garcia-Molina,et al.  Incremental updates of inverted lists for text document retrieval , 1994, SIGMOD '94.

[13]  T. Chiueh,et al.  Eecient Real-time Index Updates in Text Retrieval Systems , 1999 .

[14]  Hector Garcia-Molina,et al.  Synthetic workload performance analysis of incremental updates , 1994, SIGIR '94.

[15]  Hugh E. Williams,et al.  Efficient online index maintenance for contiguous inverted lists , 2006, Inf. Process. Manag..

[16]  Alistair Moffat,et al.  Fast on-line index construction by geometric partitioning , 2005, CIKM '05.