Search-Optimized Suffix-Tree Storage for Biological Applications

Suffix-trees are popular indexing structures for various sequence processing problems in biological data management. We investigate here the possibility of enhancing the search efficiency of disk-resident suffix-trees through customized layouts of tree-nodes to disk-pages. Specifically, we propose a new layout strategy, called Stellar, that provides significantly improved search performance on a representative set of real genomic sequences. Further, Stellar supports both the standard root-to-leaf lookup queries as well as sophisticated sequencesearch algorithms that exploit the suffix-links of suffix-trees. Our results are encouraging with regard to the ultimate objective of seamlessly integrating sequence processing in database engines.

[1]  Srikanta J. Bedathur Jayant R. Haritsa Search-Optimized Persistent Suffix Tree Storage for Biological Applications , 2005 .

[2]  S. Salzberg,et al.  Alignment of whole genomes. , 1999, Nucleic acids research.

[3]  Jens Stoye,et al.  Suffix Tree Construction and Storage with Limited Main Memory , 2003 .

[4]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[5]  Eugene L. Lawler,et al.  Approximate string matching in sublinear expected time , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[6]  Rudolf Bayer,et al.  Organization and maintenance of large ordered indexes , 1972, Acta Informatica.

[7]  Jignesh M. Patel,et al.  Practical Suffix Tree Construction , 2004, VLDB.

[8]  S. Sudarshan,et al.  Clustering Techniques for Minimizing External Path Length , 1996, VLDB.

[9]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[10]  Dan Gusfield Suffix trees (and relatives) come of age in bioinformatics , 2002, Proceedings. IEEE Computer Society Bioinformatics Conference.

[11]  Sandeep Sen,et al.  Planar Graph Blocking for External Searching , 2002, Algorithmica.

[12]  Malcolm P. Atkinson,et al.  Database indexing for large DNA and protein sequence collections , 2002, The VLDB Journal.

[13]  Michael A. Bender,et al.  Efficient Tree Layout in a Multilevel Memory Hierarchy , 2002, ESA.

[14]  Srikanta J. Bedathur,et al.  Engineering a fast online persistent suffix tree construction , 2004, Proceedings. 20th International Conference on Data Engineering.

[15]  Esko Ukkonen,et al.  Approximate String-Matching over Suffix Trees , 1993, CPM.

[16]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[17]  Malcolm P. Atkinson,et al.  A Database Index to Large Biological Sequences , 2001, VLDB.

[18]  Archie L. Cobbs,et al.  Fast Approximate Matching using Suffix Trees , 1995, CPM.

[19]  Alon Itai,et al.  How to Pack Trees , 1999, J. Algorithms.

[20]  M. Crochemore,et al.  On-line construction of suffix trees , 2002 .

[21]  Derick Wood,et al.  Approximate string matching with suffix automata , 2005, Algorithmica.

[22]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.