Complete inverted files for efficient text retrieval and analysis

Given a finite set of texts <italic>S</italic> = {<italic>w</italic>1, … , <italic>w<subscrpt>k</subscrpt></italic>} over some fixed finite alphabet &Sgr;, a complete inverted file for <italic>S</italic> is an abstract data type that provides the functions <italic>find</italic>(<italic>w</italic>), which returns the longest prefix of <italic>w</italic> that occurs (as a subword of a word) in <italic>S</italic>; <italic>freq</italic>(<italic>w</italic>), which returns the number of times <italic>w</italic> occurs in <italic>S</italic>; and <italic>locations</italic>(<italic>w</italic>), which returns the set of positions where <italic>w</italic> occurs in <italic>S</italic>. A data structure that implements a complete inverted file for <italic>S</italic> that occupies linear space and can be built in linear time, using the uniform-cost RAM model, is given. Using this data structure, the time for each of the above query functions is optimal. To accomplish this, techniques from the theory of finite automata and the work on suffix trees are used to build a deterministic finite automaton that recognizes the set of all subwords of the set <italic>S</italic>. This automaton is then annotated with additional information and compacted to facilitate the desired query functions. The result is a data structure that is smaller and more flexible than the suffix tree.

[1]  Alfred V. Aho,et al.  The Design and Analysis of Computer Algorithms , 1974 .

[2]  David Haussler,et al.  Sequence landscapes , 1986, Nucleic Acids Res..

[3]  David Haussler,et al.  Average sizes of suffix trees and DAWGs , 1989, Discret. Appl. Math..

[4]  David Haussler,et al.  Linear size finite automata for the set of all subwords of a word - an outline of results , 1983, Bull. EATCS.

[5]  David Haussler,et al.  A new distance metric on strings computable in linear time , 1988, Discret. Appl. Math..

[6]  A. Nerode,et al.  Linear automaton transformations , 1958 .

[7]  T. Kohonen Contentaddressable Memories , 1987 .

[8]  Alfonso F. Cardenas Analysis and performance of inverted data base structures , 1975, CACM.

[9]  Donald R. Morrison,et al.  PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric , 1968, J. ACM.

[10]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[11]  Mila E. Majster-Cederbaum,et al.  Efficient On-Line Construction and Correction of Position Trees , 1980, SIAM journal on computing (Print).

[12]  Journal of the Association for Computing Machinery , 1961, Nature.

[13]  David Haussler,et al.  The Smallest Automaton Recognizing the Subwords of a Text , 1985, Theor. Comput. Sci..

[14]  A. O. Slisenko,et al.  Detection of periodicities and string-matching in real time , 1983 .

[15]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[16]  Steven L. Tanimoto A method for detecting structure in polygons , 1981, Pattern Recognit..

[17]  C. J. van Rijsbergen FILE ORGANIZATION IN LIBRARY AUTOMATION AND INFORMATION RETRIEVAL , 1976 .

[18]  David Haussler,et al.  Building a complete inverted file for a set of text files in linear time , 1984, STOC '84.

[19]  Teuvo Kohonen,et al.  Content-addressable memories , 1980 .

[20]  J. Seiferas,et al.  Efficient and Elegant Subword-Tree Construction , 1985 .

[21]  V. A. J. Maller The Content Addressable File Store - A Technical Overview , 1981, Angew. Inform..