A statistics-based approach to incrementally update inverted files

Many information retrieval systems use the inverted file as indexing structure. The inverted file, however, requires inefficient reorganization when new documents are to be added to an existing collection. Most studies suggest dealing with this problem by sparing free space in an inverted file for incremental updates. In this paper, we propose a run-time statistics-based approach to allocate the spare space. This approach estimates the space requirements in an inverted file using only a little most recent statistical data on space usage and document update request rate. For best indexing speed and space efficiency, the amount of the spare space to be allocated is determined by adaptively balancing the trade-offs between reorganization reduction and space utilization. Experiment results show that the proposed space-sparing approach significantly avoids reorganization in updating an inverted file, and in the meantime, unused free space can be well controlled such that the file access speed is not affected.

[1]  Sheldon M. Ross,et al.  Stochastic Processes , 2018, Gauge Integral Structures for Stochastic Calculus and Quantum Electrodynamics.

[2]  Kotagiri Ramamohanarao,et al.  Inverted files versus signature files for text indexing , 1998, TODS.

[3]  Sheau-Dong Lang,et al.  Feature reduction and database maintenance in NETNEWS classification , 1999, Proceedings. IDEAS'99. International Database Engineering and Applications Symposium (Cat. No.PR00265).

[4]  Nick T. Thomopoulos,et al.  Applied Forecasting Methods , 1980 .

[5]  Todd King Dynamic data structures: theory and application , 1992 .

[6]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[7]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[8]  Kurt Mehlhorn,et al.  Dynamic data structures , 1979 .

[9]  Todd King Dynamic data structures , 1992 .

[10]  Li Fan,et al.  Web caching and Zipf-like distributions: evidence and implications , 1999, IEEE INFOCOM '99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No.99CH36320).

[11]  W. Bruce Croft,et al.  Fast Incremental Indexing for Full-Text Information Retrieval , 1994, VLDB.

[12]  Hector Garcia-Molina,et al.  Synthetic workload performance analysis of incremental updates , 1994, SIGIR '94.

[13]  Hector Garcia-Molina,et al.  Incremental updates of inverted lists for text document retrieval , 1994, SIGMOD '94.

[14]  Hans-Peter Kriegel,et al.  The DC-tree: a fully dynamic index structure for data warehouses , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).