Signature files: design and performance comparison of some signature extraction methods

Signature files seem to be a promtsing method for text retrieval and document retrieval [29,5,8]. Accordmg to thts method the documents are stored sequentially in one file (Text file”) while abstractions of the documents (“signatures”) are stored sequentially in another file (“signature file”). In order to resolve a query, the signature 8le is scanned first and many non-qualifying documents are immedtately rejected. In thts paper we present three old and one new signature extraction methods and compare thetr screening capacities. We derive exact and approximate formulas for the false drop probabtlrty of each method and dtscuss the new method in more detail.

[1]  M. D. McIlroy,et al.  Development of a Spelling List , 1982, IEEE Trans. Commun..

[2]  Stavros Christodoulakis,et al.  Message files , 1982, TOIS.

[3]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[4]  David C. van Voorhis,et al.  Optimal source codes for geometrically distributed integer alphabets (Corresp.) , 1975, IEEE Trans. Inf. Theory.

[5]  Christos Faloutsos,et al.  Signature files: an access method for documents and its analytical performance evaluation , 1984, TOIS.

[6]  S. Golomb Run-length encodings. , 1966 .

[7]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[8]  John L. Pfaltz,et al.  Partial-match retrieval using indexed descriptor files , 1980, CACM.

[9]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[10]  Robert T. Dattola FIRST: Flexible Information Retrieval System for Text , 1979, J. Am. Soc. Inf. Sci..

[11]  Chris M. Gravina National Westminster Bank Mass Storage Archiving , 1978, IBM Syst. J..

[12]  C.S. Roberts,et al.  Partial-match retrieval via the method of superimposed codes , 1979, Proceedings of the IEEE.

[13]  Richard A. Gustafson Elements of the randomized combinatorial file structure , 1971, SIGIR '71.

[14]  Calvin N. Mooers,et al.  Application of random codes to the gathering of statistical information , 1948 .

[15]  Roger L. Haskin,et al.  On extending the functions of a relational database system , 1982, SIGMOD '82.

[16]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[17]  Christos Faloutsos,et al.  Access methods for text , 1985, CSUR.

[18]  Christos Faloutsos,et al.  A Multimedia Office Filing System , 1983, VLDB.

[19]  Ian A. Macleod A data base management system for document retrieval applications , 1981, Inf. Syst..

[20]  Stavros Christodoulakis Framework for the Development of an Experimental Mixed-Mode Message System , 1984, SIGIR.

[21]  Christos Faloutsos,et al.  Design Considerations for a Message File Server , 1984, IEEE Transactions on Software Engineering.

[22]  Lee A. Hollaar,et al.  Text Retrieval Computers , 1979, Computer.

[23]  Robert S. Boyer,et al.  A fast string searching algorithm , 1977, CACM.

[24]  Fausto Rabitti,et al.  Evaluation of Access Methods to Text Document in Office Systems , 1984, SIGIR.

[25]  Simon Stiassny Mathematical analysis of various superimposed coding methods , 1960 .

[26]  R. M. Bird,et al.  Associative/parallel processors for searching very large textual data bases , 1977, CAW '77.