Building a complete inverted file for a set of text files in linear time

Given a finite set of texts <italic>S</italic> &equil; {ω<subscrpt>1</subscrpt>, ..., ω<subscrpt>k</subscrpt>} over some fixed finite alphabet &Sgr;, a complete inverted file for <italic>S</italic> is an abstract data type that provides the functions <italic>find</italic>(ω), which returns the longest prefix of ω which occurs in <italic>S; freq</italic>(ω), which returns the number of times ω occurs in <italic>S;</italic> and <italic>locations</italic>(ω) which returns the set of positions at which ω occurs. We give a data structure to implement a complete inverted file for <italic>S</italic> which occupies linear space and can be built in linear time, using the uniform cost RAM model. Using this data structure, the time for each of the above query functions is optimal. To accomplish this, we use techniques from the theory of finite automata to build a deterministic finite automaton which recognizes the set of all sub words of the set <italic>S.</italic> This automaton is then annotated with additional information and compacted to facilitate the desired query functions.

[1]  A. Nerode,et al.  Linear automaton transformations , 1958 .

[2]  Mila E. Majster-Cederbaum,et al.  Efficient On-Line Construction and Correction of Position Trees , 1980, SIAM journal on computing (Print).

[3]  David Haussler,et al.  Linear size finite automata for the set of all subwords of a word - an outline of results , 1983, Bull. EATCS.

[4]  C. J. van Rijsbergen FILE ORGANIZATION IN LIBRARY AUTOMATION AND INFORMATION RETRIEVAL , 1976 .

[5]  Alfonso F. Cardenas Analysis and performance of inverted data base structures , 1975, CACM.

[6]  Alfred V. Aho,et al.  The Design and Analysis of Computer Algorithms , 1974 .

[7]  V. A. J. Maller The Content Addressable File Store - A Technical Overview , 1981, Angew. Inform..

[8]  Xerox Polo,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976 .

[9]  Steven L. Tanimoto A method for detecting structure in polygons , 1981, Pattern Recognit..

[10]  Donald R. Morrison,et al.  PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric , 1968, J. ACM.

[11]  Franco P. Preparata,et al.  Optimal Off-Line Detection of Repetitions in a String , 1983, Theor. Comput. Sci..

[12]  J. Seiferas,et al.  Efficient and Elegant Subword-Tree Construction , 1985 .

[13]  A. O. Slisenko,et al.  Detection of periodicities and string-matching in real time , 1983 .

[14]  Michael Rodeh,et al.  Linear Algorithm for Data Compression via String Matching , 1981, JACM.

[15]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.