Optimal signature extraction and information loss

Signature files seem to be a promising access method for text and attributes. According to this method, the documents (or records) are stored sequentially in one file ("text file"), while abstractions of the documents ("signatures") are stored sequentially in another file ("signature file"). In order to resolve a query, the signature file is scanned first, and many nonqualifying documents are immediately rejected. We develop a framework that includes primary key hashing, multiattribute hashing, and signature files. Our effort is to find the optimal signature extraction method. The main contribution of this paper is that we present optimal and efficient suboptimal algorithms for assigning words to signatures in several environments. Another contribution is that we use information theory, and study the relationship of the false drop probability Fd and the information that is lost during signature extraction. We give tight lower bounds on the achievable Fd and show that a simple relationship holds between the two quantities in the case of optimal signature extraction with uniform occurrence and query frequencies. We examine hashing as a method to map words to signatures (instead of the optimal way), and show that the same relationship holds between Fd and loss, indicating that an invariant may exist between these two quantities for every signature extraction method.

[1]  Gerhard Jaeschke Reciprocal hashing: a method for generating minimal perfect hashing functions , 1981, CACM.

[2]  John G. Proakis,et al.  Probability, random variables and stochastic processes , 1985, IEEE Trans. Acoust. Speech Signal Process..

[3]  Christos Faloutsos,et al.  Design Considerations for a Message File Server , 1984, IEEE Transactions on Software Engineering.

[4]  M. D. McIlroy,et al.  Development of a Spelling List , 1982, IEEE Trans. Commun..

[5]  Ronald L. Rivest,et al.  Partial-Match Retrieval Algorithms , 1976, SIAM J. Comput..

[6]  Renzo Sprugnoli,et al.  Perfect hashing functions , 1977, Commun. ACM.

[7]  Kotagiri Ramamohanarao,et al.  A two level superimposed coding scheme for partial match retrieval , 1983, Inf. Syst..

[8]  Donald M. MacKay,et al.  Information, mechanism and meaning , 1969 .

[9]  John L. Pfaltz,et al.  Partial-match retrieval using indexed descriptor files , 1980, CACM.

[10]  R. Gallager Information Theory and Reliable Communication , 1968 .

[11]  E. Reingold,et al.  Combinatorial Algorithms: Theory and Practice , 1977 .

[12]  Bruno O. Shubert,et al.  Random variables and stochastic processes , 1979 .

[13]  Christos Faloutsos,et al.  Signature files: an access method for documents and its analytical performance evaluation , 1984, TOIS.

[14]  C. C. Chang The study of an ordered minimal perfect hashing scheme , 1984, CACM.

[15]  Alfred V. Aho,et al.  Optimal partial-match retrieval when fields are independently specified , 1979, ACM Trans. Database Syst..

[16]  Per-Åke Larson,et al.  A Method for Speeding Up Text Retrieval , 1983, Databases for Business and Office Applications.

[17]  Stavros Christodoulakis,et al.  Message files , 1982, TOIS.

[18]  Malcolm C. Harrison,et al.  Implementation of the substring test by hashing , 1971, CACM.

[19]  Calvin N. Mooers,et al.  Application of random codes to the gathering of statistical information , 1948 .

[20]  John W. Lloyd Optimal partial-match retrieval , 1980, BIT Comput. Sci. Sect..

[21]  Christos Faloutsos,et al.  Signature files: design and performance comparison of some signature extraction methods , 1985, SIGMOD Conference.

[22]  Richard A. Gustafson Elements of the randomized combinatorial file structure , 1971, SIGIR '71.

[23]  Larry Carter,et al.  Universal Classes of Hash Functions , 1979, J. Comput. Syst. Sci..

[24]  Guy M. Lohman,et al.  Differential files: their application to the maintenance of large databases , 1976, TODS.

[25]  Kotagiri Ramamohanarao,et al.  Partial-match retrieval for dynamic files , 1982, BIT.

[26]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[27]  Harry D. Huskey,et al.  An information retrieval system based on superimposed coding , 1899, AFIPS '69 (Fall).

[28]  C.S. Roberts,et al.  Partial-match retrieval via the method of superimposed codes , 1979, Proceedings of the IEEE.

[29]  I. Olkin,et al.  Inequalities: Theory of Majorization and Its Applications , 1980 .

[30]  James B. Rothnie,et al.  Attribute based file organization in a paged memory environment , 1974, CACM.