Selection of equifrequent word fragments for information retrieval

Abstract The design of programs to search large document data bases is discussed with regard to the use of compression coding combined with adoption of word fragments as the basic language elements. An algorithm is described for determination of a set of almost equifrequent fragments. Its efficiency is tested for a sample data base formed from the MARC tapes. A certain threshold frequency acts as a parameter whose value determines the number of distinct fragments. The selection algorithm is designed to give some preference to choice of the longest fragments and hence allow compact coding of the data base by concatenation of non-overlapping fragments.

[1]  Eugene S. Schwartz,et al.  A Language Element for Compression Coding , 1967, Inf. Control..

[2]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[3]  Michael F. Lynch,et al.  The identification of variable-length, equifrequent character strings in a natural language data base , 1972, Comput. J..

[4]  Robert A. Wagner,et al.  Binary Pattern Reconstruction from Projections [Z] (Algorithm 445) , 1973, Communications of the ACM.

[5]  A. K. Scidmore,et al.  Storage and search properties of a tree-organized memory system , 1963, CACM.

[6]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[7]  Claude E. Shannon,et al.  The mathematical theory of communication , 1950 .

[8]  Larry H. Thiel,et al.  Optimum procedures for economic information retrieval , 1970, Inf. Storage Retr..

[9]  Andrew Donald Booth,et al.  A "Law" of Occurrences for Words of Low Frequency , 1967, Inf. Control..

[10]  H. S. Heaps Storage Analysis Of A Compression Coding For Document Data Bases , 1972 .

[11]  Peter B. Schipma Term Fragment Analysis for Inversion of Large Files. , 1971 .

[12]  Larry H. Thiel,et al.  Program design for retrospective searches on large data bases , 1972, Inf. Storage Retr..

[13]  Michael F. Lynch,et al.  Compression of bibliographic files using an adaptation of run-length coding , 1973, Inf. Storage Retr..

[14]  Michael F. Lynch,et al.  Analysis of the microstructure of titles in the inspec data-base , 1973, Inf. Storage Retr..

[15]  Thomas C. Lowe The Influence of Data Base Characteristics and Usage on Direct Access File Organization , 1968, J. ACM.

[16]  J. E. Rush,et al.  Use of Word Fragments in Computer-Based Retrieval Systems , 1969 .