Access methods for text

This paper compares text retrieval methods intended for office systems. The operational requirements of the office environment are discussed, and retrieval methods from database systems and from information retrieval systems are examined. We classify these methods and examine the most interesting representatives of each class. Attempts to speed up retrieval with special purpose hardware are also presented, and issues such as approximate string matching and compression are discussed. A qualitative comparison of the examined methods is presented. The signature file method is discussed in more detail.

[1]  Per-Åke Larson,et al.  A Method for Speeding Up Text Retrieval , 1983, Databases for Business and Office Applications.

[2]  Clement T. Yu,et al.  Term Weighting in Information Retrieval Using the Term Precision Model , 1982, JACM.

[3]  Jürg Nievergelt,et al.  The Grid File: An Adaptable, Symmetric Multikey File Structure , 1984, TODS.

[4]  Michael Stonebraker,et al.  Document processing in a relational database system , 1983, TOIS.

[5]  David C. van Voorhis,et al.  Optimal source codes for geometrically distributed integer alphabets (Corresp.) , 1975, IEEE Trans. Inf. Theory.

[6]  Terry A. Welch,et al.  A Technique for High-Performance Data Compression , 1984, Computer.

[7]  James B. Rothnie,et al.  Attribute based file organization in a paged memory environment , 1974, CACM.

[8]  Rene De La Briandais File searching using variable length keys , 1959, IRE-AIEE-ACM Computer Conference.

[9]  Roger L. Haskin,et al.  Architecture and Operation of a Large, Full-Text Information-Retrieval System , 1983, Advanced Database Machine Architecture.

[10]  S. Golomb Run-length encodings. , 1966 .

[11]  Calvin N. Mooers,et al.  Application of random codes to the gathering of statistical information , 1948 .

[12]  Roger L. Haskin,et al.  On extending the functions of a relational database system , 1982, SIGMOD '82.

[13]  James L. Peterson,et al.  Computer programs for detecting and correcting spelling errors , 1980, CACM.

[14]  Larry Fujitani Laser optical disk: the coming revolution in on-line storage , 1984, CACM.

[15]  H. S. Heaps,et al.  Query processing in a retrospective document retrieval system that uses word fragments as language elements , 1976, Inf. Process. Manag..

[16]  Ronald Fagin,et al.  Extendible hashing—a fast access method for dynamic files , 1979, ACM Trans. Database Syst..

[17]  Forbes J. Burkowski A Hardware Hashing Scheme in the Design of a Multiterm String Comparator , 1982, IEEE Transactions on Computers.

[18]  G. N.N. Martin,et al.  Spiral Storage: Incrementally Augmentable Hash Addressed Storage , 1979 .

[19]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[20]  Patrick A. V. Hall,et al.  Approximate String Matching , 1994, Encyclopedia of Algorithms.

[21]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[22]  Edward Fredkin,et al.  Trie memory , 1960, Commun. ACM.

[23]  Sudhir Ahuja,et al.  An associative/parallel processor for partial match retrieval using superimposed codes , 1980, ISCA '80.

[24]  Roger L. Haskin,et al.  Special-Purpose Processors for Text Retrieval. , 1981 .

[25]  Christos Faloutsos,et al.  Signature files: design and performance comparison of some signature extraction methods , 1985, SIGMOD Conference.

[26]  Ronald L. Rivest,et al.  Partial-Match Retrieval Algorithms , 1976, SIAM J. Comput..

[27]  Chris M. Gravina National Westminster Bank Mass Storage Archiving , 1978, IBM Syst. J..

[28]  Michael F. Lynch,et al.  An information-theoretic approach to text searching in direct access systems , 1974, CACM.

[29]  C.S. Roberts,et al.  Partial-match retrieval via the method of superimposed codes , 1979, Proceedings of the IEEE.

[30]  Gerard Salton,et al.  Dynamic information and library processing , 1975 .

[31]  Gerard Salton,et al.  Generation and search of clustered files , 1978, TODS.

[32]  Christos Faloutsos,et al.  Signature files: an access method for documents and its analytical performance evaluation , 1984, TOIS.

[33]  Stephen F. Weiss,et al.  A relevance feedback system based on document transformations , 1967 .

[34]  Donald E. Knuth,et al.  The art of computer programming: sorting and searching (volume 3) , 1973 .

[35]  W. Bruce Croft A model of cluster searching bases on classification , 1980, Inf. Syst..

[36]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[37]  Lee A. Hollaar Specialized merge processor networks for combining sorted lists , 1978, TODS.

[38]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[39]  Richard C. Singleton,et al.  Nonrandom binary superimposed codes , 1964, IEEE Trans. Inf. Theory.

[40]  J. T. Robinson,et al.  The K-D-B-tree: a search structure for large multidimensional dynamic indexes , 1981, SIGMOD '81.

[41]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[42]  Roger L. Haskin,et al.  Operational characteristics of a harware-based pattern matcher , 1983, TODS.

[43]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[44]  Christos Faloutsos,et al.  A Multimedia Office Filing System , 1983, VLDB.

[45]  Ian A. Macleod A data base management system for document retrieval applications , 1981, Inf. Syst..

[46]  Charles P. Bourne,et al.  Methods of information handling , 1963 .

[47]  Harry D. Huskey,et al.  An information retrieval system based on superimposed coding , 1969, AFIPS '69 (Fall).

[48]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[49]  Charles P. Bourne,et al.  Frequency and impact of spelling errors in bibliographic data bases , 1977, Inf. Process. Manag..

[50]  Kenneth E. Batcher,et al.  Sorting networks and their applications , 1968, AFIPS Spring Joint Computing Conference.

[51]  Douglas Comer,et al.  Ubiquitous B-Tree , 1979, CSUR.

[52]  G. Orosz,et al.  SOME PROBABILITY PROBLEMS CONCERNING THE MARKING OF CODES INTO THE SUPERIMPOSITION FIELD , 1956 .

[53]  John Howard Johnson Formal models for string similarity , 1983 .

[54]  Peter Willett,et al.  Automatic Spelling Correction Using a Trigram Similarity Measure , 1983, Inf. Process. Manag..

[55]  Christos Faloutsos,et al.  Design of a Signature File Method that Accounts for Non-Uniform Occurrence and Query Frequencies , 1985, VLDB.

[56]  Stavros Christodoulakis,et al.  Message files , 1982, TOIS.

[57]  Malcolm C. Harrison,et al.  Implementation of the substring test by hashing , 1971, CACM.

[58]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[59]  Dennis G. Severance,et al.  Identifier Search Mechanisms: A Survey and Generalized Model , 1974, CSUR.

[60]  Stavros Christodoulakis Framework for the Development of an Experimental Mixed-Mode Message System , 1984, SIGIR.

[61]  John W. Lloyd Optimal partial-match retrieval , 1980, BIT Comput. Sci. Sect..

[62]  Kotagiri Ramamohanarao,et al.  Partial-match retrieval for dynamic files , 1982, BIT.

[63]  Fausto Rabitti,et al.  Evaluation of Access Methods to Text Document in Office Systems , 1984, SIGIR.

[64]  Gerard Salton,et al.  Automatic Information Retrieval , 1980, Computer.

[65]  Karen Sparck Jones A statistical interpretation of term specificity and its application in retrieval , 1972 .

[66]  Gerard Salton,et al.  Experiments in Automatic Thesaurus Construction for Information Retrieval , 1971, IFIP Congress.

[67]  Solomon W. Golomb,et al.  Run-length encodings (Corresp.) , 1966, IEEE Trans. Inf. Theory.

[68]  Godfrey Dewey,et al.  Relativ frequency of English speech sounds , 1923 .

[69]  Christos Faloutsos,et al.  Design Considerations for a Message File Server , 1984, IEEE Transactions on Software Engineering.

[70]  Lee A. Hollaar,et al.  Text Retrieval Computers , 1979, Computer.

[71]  John L. Pfaltz,et al.  Partial-match retrieval using indexed descriptor files , 1980, CACM.

[72]  C. J. van Rijsbergen,et al.  An Algorithm for Information Structuring and Retrieval , 1971, Comput. J..

[73]  Edward A. Fox,et al.  Research Contributions , 2014 .

[74]  Richard A. Gustafson Elements of the randomized combinatorial file structure , 1971, SIGIR '71.

[75]  R. Staveley A Theory for Practical Education in Librarianship. , 1972 .

[76]  Charles T. Zahn,et al.  Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters , 1971, IEEE Transactions on Computers.

[77]  M. Douglas,et al.  Development of a Spelling List , 1982 .

[78]  Gerard Salton,et al.  Recent Studies in Automatic Text Analysis and Document Retrieval , 1973, JACM.

[79]  Robert A. Wagner,et al.  An Extension of the String-to-String Correction Problem , 1975, JACM.

[80]  William S. Cooper On Deriving Design Equations for Information Retrieval Systems. , 1970 .

[81]  Per-Åke Larson,et al.  Dynamic hashing , 1978, BIT.

[82]  Alfred V. Aho,et al.  Optimal partial-match retrieval when fields are independently specified , 1979, ACM Trans. Database Syst..

[83]  Clement T. Yu,et al.  Analysis of Effectiveness of Retrieval in Clustered Files , 1977, JACM.

[84]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[85]  Robert T. Dattola FIRST: Flexible Information Retrieval System for Text , 1979, J. Am. Soc. Inf. Sci..

[86]  Fred J. Damerau,et al.  A technique for computer detection and correction of spelling errors , 1964, CACM.

[87]  Simon Stiassny Mathematical analysis of various superimposed coding methods , 1960 .

[88]  R. M. Bird,et al.  Associative/parallel processors for searching very large textual data bases , 1977, CAW '77.

[89]  William H. Stellhorn,et al.  An Inverted File Processor for Information Retrieval , 1977, IEEE Transactions on Computers.

[90]  Gary D. Knott,et al.  Expandable open addressing hash table storage and retrieval , 1971, SIGFIDET '71.

[91]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[92]  Robert S. Boyer,et al.  A fast string searching algorithm , 1977, CACM.