Fast Text Access Methods for Optical and Large Magnetic Disks: Designs and Performance Comparison

High capacity disks, especially optical ones, are commercially available. These disks are ideal for archiving large text data bases. In this work, we examine efficient searching techniques for such applications. We propose a unifying framework, which reveals the similarities between signature files and an inverted file using a hash table. Then, we design methods that combine the ease of insertion of the signature files with the fast retrieval of the inverted files. We develop analytical models for their performance and we verify it through experimentation on a 2.8 Mb data base. The agreement between theory and experimentation is very good. The results show that the proposed methods achieve fast retrieval, they require a modest lo%-30% space overhead, (as opposed to 50%-300% overhead [13] for the inverted files), and they do not require rewriting; thus, they can handle insertions easily, they permit searches during an insertion and they can be used with write-once optical disks. Using our verified model, the performance predictions for the proposed methods on large data bases (e.g., 250 Mb) are very promising.

[1]  Thomas A. Standish An Essay on Software Reuse , 1984, IEEE Transactions on Software Engineering.

[2]  Ben Shneiderman,et al.  An Experimental Comparison of a Mouse and Arrow-Jump Keys for an Interactive Encyclopedia , 1986, Int. J. Man Mach. Stud..

[3]  Soon Myoung Chung,et al.  Computer Architecture for a Surrogate File to a Very Large Data/Knowledge Base , 1987, Computer.

[4]  Kotagiri Ramamohanarao,et al.  A Superimposed Codeword Indexing Scheme for Very Large Prolog Databases , 1986, ICLP.

[5]  John L. Pfaltz,et al.  Partial-match retrieval using indexed descriptor files , 1980, CACM.

[6]  Gaston H. Gonnet,et al.  Mind Your Grammar: a New Approach to Modelling Text , 1987, VLDB.

[7]  Craig Stanfill,et al.  Parallel free-text search on the connection machine system , 1986, CACM.

[8]  Christos Faloutsos,et al.  Signature files: an access method for documents and its analytical performance evaluation , 1984, TOIS.

[9]  Stavros Christodoulakis,et al.  The multimedia object presentation manager of MINOS: a symmetric approach , 1986, SIGMOD '86.

[10]  Fausto Rabitti,et al.  Evaluation of Access Methods to Text Document in Office Systems , 1984, SIGIR.

[11]  Simon Stiassny Mathematical analysis of various superimposed coding methods , 1960 .

[12]  Roger L. Haskin,et al.  Special-Purpose Processors for Text Retrieval. , 1981 .

[13]  Christos Faloutsos,et al.  Description and performance analysis of signature file methods for office filing , 1987, TOIS.

[14]  James L. Peterson,et al.  Computer programs for detecting and correcting spelling errors , 1980, CACM.

[15]  Roger L. Haskin,et al.  Architecture and Operation of a Large, Full-Text Information-Retrieval System , 1983, Advanced Database Machine Architecture.

[16]  Christos Faloutsos,et al.  Design Considerations for a Message File Server , 1984, IEEE Transactions on Software Engineering.

[17]  Lee A. Hollaar,et al.  Text Retrieval Computers , 1979, Computer.

[18]  Stavros Christodoulakis,et al.  Message files , 1982, TOIS.

[19]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[20]  Kotagiri Ramamohanarao,et al.  A two level superimposed coding scheme for partial match retrieval , 1983, Inf. Syst..

[21]  George R. Thoma,et al.  A prototype system for the electronic storage and retrieval of document images , 1985, TOIS.

[22]  Christos Faloutsos,et al.  Access methods for text , 1985, CSUR.

[23]  Gary D. Knott,et al.  Expandable open addressing hash table storage and retrieval , 1971, SIGFIDET '71.

[24]  Larry Fujitani Laser optical disk: the coming revolution in on-line storage , 1984, CACM.

[25]  Calvin N. Mooers,et al.  Application of random codes to the gathering of statistical information , 1948 .