论文信息 - A fast algorithm for making suffix arrays and for Burrows-Wheeler transformation

A fast algorithm for making suffix arrays and for Burrows-Wheeler transformation

We propose a fast and memory efficient algorithm for sorting suffixes of a text in lexicographic order. It is important to sort suffixes because an array of indexes of suffixes is called a suffix array and it is a memory efficient alternative of the suffix tree. Sorting suffixes is also used for the Burrows-Wheeler (see Technical Report 124, Digital SRC Research Report, 1994) transformation in the block sorting text compression, therefore fast sorting algorithms are desired. We compare algorithms for making suffix arrays of Bentley-Sedgewick (see Proceedings of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms, p.360-9, 1997), Andersson-Nilsson (see 35th Symp. on Foundations of Computer Science, p.714-21, 1994) and Karp-Miller-Rosenberg (1972) and making suffix trees of Larsson (see Data Compression Conference, p.190-9, 1996) on the speed and required memory and propose a new algorithm which is fast and memory efficient by combining them. We also define a measure of difficulty of sorting suffixes: average match length. Our algorithm is effective when the average match length of a text is large, especially for large databases.

Kunihiko Sadakane | K. Sadakane

[1] Robert Sedgewick,et al. Fast algorithms for sorting and searching strings , 1997, SODA '97.

[2] Xerox Polo,et al. A Space-Economical Suffix Tree Construction Algorithm , 1976 .

[3] N. Jesper Larsson. Extended application of suffix trees to data compression , 1996, Proceedings of Data Compression Conference - DCC '96.

[4] Ian H. Witten,et al. Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..

[5] Eugene W. Myers,et al. Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[6] Timothy C. Bell,et al. A corpus for the evaluation of lossless compression algorithms , 1997, Proceedings DCC '97. Data Compression Conference.

[7] Arnold L. Rosenberg,et al. Rapid identification of repeated patterns in strings, trees and arrays , 1972, STOC.

[8] D. J. Wheeler,et al. A Block-sorting Lossless Data Compression Algorithm , 1994 .

[9] Arne Andersson,et al. A new efficient radix sort , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.