Inverted files versus signature files for text indexing

Two well-known indexing methods are inverted files and signature files. We have undertaken a detailed comparison of these two approaches in the context of text indexing, paying particular attention to query evaluation speed and space requirements. We have examined their relative performance using both experimentation and a refined approach to modeling of signature files, and demonstrate that inverted files are distinctly superior to signature files. Not only can inverted files be used to evaluate typical queries in less time than can signature files, but inverted files require less space and provide greater functionality. Our results also show that a synthetic text database can provide a realistic indication of the behavior of an actual text database. The tools used to generate the synthetic database have been made publicly available

[1]  Solomon W. Golomb,et al.  Run-length encodings (Corresp.) , 1966, IEEE Trans. Inf. Theory.

[2]  S. Golomb Run-length encodings. , 1966 .

[3]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[4]  Alfonso F. Cardenas Analysis and performance of inverted data base structures , 1975, CACM.

[5]  Patrick A. V. Hall,et al.  Approximate String Matching , 1994, Encyclopedia of Algorithms.

[6]  Roger L. Haskin,et al.  Special-Purpose Processors for Text Retrieval. , 1981 .

[7]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[8]  G. Salton,et al.  Extended Boolean information retrieval , 1983, CACM.

[9]  Christos Faloutsos,et al.  Access methods for text , 1985, CSUR.

[10]  Kotagiri Ramamohanarao,et al.  Multikey access methods based on superimposed coding techniques , 1987, TODS.

[11]  W. Bruce Croft,et al.  Implementing ranking strategies using text signatures , 1988, TOIS.

[12]  Jae-Woo Chang,et al.  Multikey access methods based on term discrimination and signature clustering , 1989, SIGIR '89.

[13]  Donna Harman,et al.  Retrieving Records from a Gigabyte of Text on a Minicomputer Using Statistical Ranking. , 1990 .

[14]  Kotagiri Ramamohanarao,et al.  A Signature File Scheme Based on Multiple Organizations for Indexing Very Large Text Databases. , 1990 .

[15]  Pavel Zezula,et al.  Dynamic partitioning of signature files , 1991, TOIS.

[16]  Kui-Lam Kwok,et al.  Retrieval Experiments with a Large Collection using PIRCS , 1992, TREC.

[17]  Alistair Moffat,et al.  Economical Inversion of Large Text Files , 1992, Comput. Syst..

[18]  L. R. Rasmussen,et al.  In information retrieval: data structures and algorithms , 1992 .

[19]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[20]  Ron Sacks-Davis,et al.  An e cient indexing technique for full-text database systems , 1992, VLDB 1992.

[21]  Alistair Moffat,et al.  An Efficient Indexing Technique for Full Text Databases , 1992, Very Large Data Bases Conference.

[22]  Christos Faloutsos,et al.  Signature Files , 1992, Information Retrieval: Data Structures & Algorithms.

[23]  Donna K. Harman,et al.  Ranking Algorithms , 1992, Information Retrieval: Data Structures & Algorithms.

[24]  Christos Faloutsos,et al.  Hybrid Index Organizations for Text Databases , 1992, EDBT.

[25]  Edward A. Fox,et al.  Inverted Files , 1992, Information Retrieval: Data Structures & Algorithms.

[26]  Pavel Zezula,et al.  Estimating accesses in partitioned signature file organizations , 1993, TOIS.

[27]  Ian H. Witten,et al.  Data Compression in Full-Text Retrieval Systems , 1993, J. Am. Soc. Inf. Sci..

[28]  Craig Stanfill,et al.  Compression of indexes with full positional information in very large text databases , 1993, SIGIR.

[29]  Ian H. Witten,et al.  Data compression in full-text retrieval systems , 1993 .

[30]  Alistair Moffat,et al.  Storage Management for Files of Dynamic Records , 1993, Australian Database Conference.

[31]  Udi Manber,et al.  GLIMPSE: A Tool to Search Through Entire File Systems , 1994, USENIX Winter.

[32]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[33]  Kotagiri Ramamohanarao,et al.  Atlas: A Nested Relational Database System for Text Applications , 1995, IEEE Trans. Knowl. Data Eng..

[34]  Alistair Moffat,et al.  In Situ Generation of Compressed Inverted Files , 1995, J. Am. Soc. Inf. Sci..

[35]  Alistair Moffat,et al.  In Situ Generation of Compressed Inverted Files , 1995, J. Am. Soc. Inf. Sci..

[36]  Alistair Moffat,et al.  Adding compression to a full‐text retrieval system , 1995, Softw. Pract. Exp..

[37]  Kotagiri Ramamohanarao,et al.  Guidelines for presentation and comparison of indexing techniques , 1996, SGMD.

[38]  Pavel Zezula,et al.  Declustering of key-based partitioned signature files , 1996, TODS.

[39]  Alistair Moffat,et al.  Self-indexing inverted files for fast text retrieval , 1996, TOIS.

[40]  Dik Lun Lee,et al.  Document ranking on weight-partitioned signature files , 1996, TOIS.

[41]  angesichts der Corona-Pandemie,et al.  UPDATE , 1973, The Lancet.

[42]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .