Text Searching and Indexing

Although data is stored in various ways, text remains the main form of exchanging information. This is particularly evident in literature or linguistics where data is composed of huge corpora and dictionaries. This applies as well to computer science, where a large amount of data is stored in linear files. And this is also the case in molecular biology where biological molecules can often be approximated as sequences of nucleotides or amino acids. Moreover, the quantity of available data in this fields tends to double every 18 months. This is the reason why algorithms should be efficient even if the speed of computers increases at a steady pace. Pattern matching is the problem of locating a specific pattern inside raw data. The pattern is usually a collection of strings described in some formal language. In this chapter we present several algorithms for solving the problem when the pattern is composed of a single string. In several applications, texts need to be structured before being searched. Even if no further information is known about their syntactic structure, it is possible and indeed extremely efficient to build a data structure that support searches. In this chapter we present suffix arrays, suffix trees, suffix automata and compact suffix automata.

[1]  Christophe Hancart On Simon's String Searching Algorithm , 1993, Inf. Process. Lett..

[2]  Max Chochemore Linear searching for a square in a word , 1984, Bull. EATCS.

[3]  Srinivas Aluru,et al.  Space efficient linear time construction of suffix arrays , 2003, J. Discrete Algorithms.

[4]  Andrew Chi-Chih Yao,et al.  The Complexity of Pattern Matching for a Random String , 1977, SIAM J. Comput..

[5]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[6]  Peter Sanders,et al.  Simple Linear Work Suffix Array Construction , 2003, ICALP.

[7]  David Haussler,et al.  Linear size finite automata for the set of all subwords of a word - an outline of results , 1983, Bull. EATCS.

[8]  Arne Andersson,et al.  Improved Behaviour of Tries by Adaptive Branching , 1993, Inf. Process. Lett..

[9]  Maxime Crochemore,et al.  On the implementation of compact DAWG's , 2002, CIAA'02.

[10]  S. Srinivasa Rao,et al.  Space Efficient Suffix Trees , 1998, J. Algorithms.

[11]  Maxime Crochemore String-Matching on Ordered Alphabets , 1992, Theor. Comput. Sci..

[12]  Juha Kärkkäinen Suffix Cactus: A Cross between Suffix Tree and Suffix Array , 1995, CPM.

[13]  Leonidas J. Guibas,et al.  A New Proof of the Linearity of the Boyer-Moore String Searching Algorithm , 1980, SIAM J. Comput..

[14]  Esko Ukkonen,et al.  Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..

[15]  Roberto Grossi,et al.  The string B-tree: a new data structure for string search in external memory and its applications , 1999, JACM.

[16]  Wojciech Plandowski,et al.  Speeding up two string-matching algorithms , 2005, Algorithmica.

[17]  Wojciech Plandowski,et al.  Constant-Space String Matching with Smaller Number of Comparisons: Sequential Sampling , 1995, CPM.

[18]  Zvi Galil,et al.  Time-Space-Optimal String Matching , 1983, J. Comput. Syst. Sci..

[19]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[20]  Richard Cole,et al.  Tighter Lower Bounds on the Exact Complexity of String Matching , 1995, SIAM J. Comput..

[21]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[22]  Dong Kyue Kim,et al.  Linear-Time Construction of Suffix Arrays , 2003, CPM.

[23]  Maxime Crochemore,et al.  Direct Construction of Compact Directed Acyclic Word Graphs , 1997, CPM.

[24]  Richard Cole Tight bounds on the complexity of the Boyer-Moore string matching algorithm , 1991, SODA '91.

[25]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[26]  Imre Simon,et al.  Sequence Comparision: Some Theory and Some Practice , 1987, Electronic Dictionaries and Automata in Computational Linguistics.

[27]  Maxime Crochemore,et al.  Tight Bounds on the Complexity of the Apostolico-Giancarlo Algorithm , 1997, Inf. Process. Lett..

[28]  Dany Breslauer,et al.  Tight Comparison Bounds for the String Prefix-Matching Problem , 1993, Inf. Process. Lett..

[29]  Giancarlo Mauri,et al.  On-line construction of compact directed acyclic word graphs , 2005, Discret. Appl. Math..

[30]  M. Farach Optimal suffix tree construction with large alphabets , 1997, Proceedings 38th Annual Symposium on Foundations of Computer Science.

[31]  Zvi Galil On improving the worst case running time of the Boyer-Moore string matching algorithm , 1979, CACM.

[32]  Raffaele Giancarlo,et al.  The Boyer-Moore-Galil String Searching Strategies Revisited , 1986, SIAM J. Comput..

[33]  Robert S. Boyer,et al.  A fast string searching algorithm , 1977, CACM.

[34]  Maxime Crochemore,et al.  Two-way string-matching , 1991, JACM.

[35]  Raffaele Giancarlo,et al.  On the Exact Complexity of String Matching: Lower Bounds , 1991, SIAM J. Comput..