论文信息 - Suffix Arrays for Multiple Strings: A Method for On-Line Multiple String Searches

Suffix Arrays for Multiple Strings: A Method for On-Line Multiple String Searches

Manber and Myers' suffix array for a single string is a useful data structure for solving string matching problems. In this paper, we will show how to generalize their idea to multiple strings. We call this generalization the generalized suffix array. We present algorithms for constructing a generalized suffix array and for searching the array. Let A denote the set of strings for which we are to build a generalized suffix array. Let N be the sum of the lengths of all strings in A and n the length of the longest string in A. Our sort algorithm needs O(N log n) time in the worst case using O(N) storage to construct the generalized suffix array and the information about the longest common prefixes (lcps) between adjacent suffixes in the suffix array which will be required by the search algorithm. Given the suffix array and its lcp information, the search algorithm answers an on-line search query of the type, “Is W a substring of some strings in A? If so, where does it occur within strings of A?” in O(¦W¦+log N) time in the worst case. The above bounds are independent of the size of the underlying alphabet Σ. We then apply the generalized suffix array to the problem of finding all occurrences of an m×m matrix (the pattern) as a submatrix in a larger n×n matrix (the text). Our solution falls into the class of the 2D pattern matching algorithms that first preprocess the text and then search for the pattern. After preprocessing the text using O(n2 log n) time and O(n2) space, our algorithm can find all occurrences of the pattern in the text in expectedtime sublinear in the size of the pattern. To the best of our knowledge, our algorithm is the average-case fastest algorithm in its class.

Fei Shi | Fei Shi

[1] Peter Weiner,et al. Linear Pattern Matching Algorithms , 1973, SWAT.

[2] Robert E. Tarjan,et al. Fast Algorithms for Finding Nearest Common Ancestors , 1984, SIAM J. Comput..

[3] Gary Benson,et al. Alphabet independent two dimensional matching , 1992, STOC '92.

[4] Franco P. Preparata,et al. Structural Properties of the String Statistics Problem , 1985, J. Comput. Syst. Sci..

[5] Richard M. Karp,et al. Efficient Randomized Pattern-Matching Algorithms , 1987, IBM J. Res. Dev..

[6] A. Falaschi,et al. Levels of DNA polymerase-alpha and beta in normal and xeroderma pigmentosum fibroblasts , 1977, Nucleic Acids Res..

[7] Lucas Chi Kwong Hui,et al. Color Set Size Problem with Application to String Matching , 1992, CPM.

[8] Eugene L. Lawler,et al. Approximate string matching in sublinear expected time , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[9] Edward M. McCreight,et al. A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[10] Eugene W. Myers,et al. Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[11] Uzi Vishkin,et al. On Finding Lowest Common Ancestors: Simplification and Parallelization , 1988, AWOC.