论文信息 - Suffix trees for very large inputs

Suffix trees for very large inputs

A suffix tree is a fundamental data structure for string searching algorithms. Unfortunately, when it comes to the use of suffix trees in real-life applications, the current methods for constructing suffix trees do not scale for large inputs. As suffix trees are larger than their input sequences and quickly outgrow the main memory, the first half of this work is focused on designing a practical algorithm that avoids massive random access to the trees being built. This effort resulted in a new algorithm DiGeST which improves significantly over previous work in reducing random access to the suffix tree and performing only two passes over disk data. As a result, this algorithm scales to larger genomic data than managed before. All the existing practical algorithms perform random access to the input string, thus requiring in essence that the input be small enough to be kept in main memory. The ever increasing amount of genomic data requires however the ability to build suffix trees for much larger strings. In the second half of this work we present another suffix tree construction algorithm, B2ST that is able to construct suffix trees for input sequences significantly larger than the size of the available main memory. Both the input string and the suffix tree are kept on disk and the algorithm is designed to avoid multiple random I/Os to both of them. As a proof of concept, we show that B2 ST allows to build a suffix tree for 12 GB of real DNA sequences in 26 hours on a single machine with 2 GB of RAM. This input is four times the size of the Human Genome. The construction of suffix trees for inputs of such magnitude was never reported before. Finally, we show that, after the off-line suffix tree construction is complete, search queries on entire sequenced genomes can be performed very efficiently. This high query performance is achieved due to a special disk layout of the suffix trees produced by our algorithms.

Marina Barsky | Marina Barsky

[1] Dan Gusfield. Algorithms on Strings, Trees, and Sequences: First Applications of Suffix Trees , 1997 .

[2] Wing-Kai Hon,et al. Breaking a time-and-space barrier in constructing full-text indices , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[3] Mark Nelson,et al. Fast string searching with suffix trees , 1996 .

[4] Wojciech Szpankowski,et al. Self-Alignments in Words and Their Applications , 1992, J. Algorithms.

[5] Dan Gusfield,et al. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[6] Paolo Ferragina,et al. A Theoretical and Experimental Study on the Construction of Suffix Arrays in External Memory , 2001, Algorithmica.

[7] Jennifer Widom,et al. Database System Implementation , 2000 .

[8] Alok Aggarwal,et al. The input/output complexity of sorting and related problems , 1988, CACM.

[9] Arne Andersson,et al. Suffix Trees on Words , 1996, CPM.

[10] Alex Thomo,et al. A new method for indexing genomes using on-disk suffix trees , 2008, CIKM '08.

[11] Konstantin Makarychev,et al. Serial and parallel methods for i/o efficient suffix tree construction , 2009, SIGMOD Conference.