Serial and parallel methods for i/o efficient suffix tree construction

Over the past three decades, the suffix tree has served as a fundamental data structure in string processing. However, its widespread applicability has been hindered due to the fact that suffix tree construction does not scale well with the size of the input string. With advances in data collection and storage technologies, large strings have become ubiquitous, especially across emerging applications involving text, time series, and biological sequence data. To benefit from these advances, it is imperative that we realize a scalable suffix tree construction algorithm. To deal with the aforementioned challenge, the past few years have seen the emergence of several disk-based suffix tree construction algorithms. However, construction times continue to be daunting -- for e.g., indexing the entire Human genome still takes over 30 hours on a system with 2 gigabytes of physical memory. In this paper, first, we empirically demonstrate and argue that all existing suffix tree construction algorithms have a severe limitation -- to glean reasonable disk I/O efficiency, the input string being indexed must fit in main memory. This limitation is attributed to the poor locality properties of existing suffix tree construction algorithms and inhibits both sequential and parallel scalability. To deal with this limitation, second, we show that through careful algorithm design, one of the simplest suffix tree construction algorithms can be re-architected to build a suffix tree in a tiled fashion, allowing the implementation to maintain a constant working set size and fixed memory footprint when indexing strings of any size. Third, we show how improved locality of reference coupled with effective collective communication facilitates an efficient parallelization on massively parallel systems like the IBM Blue Gene/L. Finally, we empirically show that the proposed approach affords improvements of several orders of magnitude when indexing large strings. Furthermore, we demonstrate that the proposed parallelization is scalable and allows one to index the entire Human genome on a 1024 processor system in under 15 minutes.

[1]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[2]  Jignesh M. Patel,et al.  OASIS: An Online and Accurate Technique for Local-alignment Searches on Biological Sequences , 2003, VLDB.

[3]  S. Salzberg,et al.  Alignment of whole genomes. , 1999, Nucleic acids research.

[4]  J. Stoye,et al.  REPuter: the manifold applications of repeat analysis on a genomic scale. , 2001, Nucleic acids research.

[5]  Mohammed J. Zaki,et al.  TRELLIS+: An Effective Approach for Indexing Genome-Scale Sequences Using Suffix Trees , 2008, Pacific Symposium on Biocomputing.

[6]  R. Mehnert,et al.  Public Collections of DNA and RNA Sequence Reach 100 Gigabases , 2005 .

[7]  Eamonn J. Keogh,et al.  Disk aware discord discovery: finding unusual time series in terabyte sized datasets , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[8]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[9]  A. L. Brown Constructing chromosome scale suffix trees , 2004 .

[10]  Eamonn J. Keogh,et al.  Disk Aware Discord Discovery: Finding Unusual Time Series in Terabyte Sized Datasets , 2007, ICDM.

[11]  S. Muthukrishnan,et al.  Overcoming the memory bottleneck in suffix tree construction , 1998, Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No.98CB36280).

[12]  Nicholas L. Bray,et al.  AVID: A global alignment program. , 2003, Genome research.

[13]  Robert Japp The Top-Compressed Suffix Tree A Disk-Resident Index for Large Sequences , 2004 .

[14]  Marie-France Sagot,et al.  Efficient Extraction of Structured Motifs Using Box-Links , 2004, SPIRE.

[15]  Srikanta J. Bedathur,et al.  Search-Optimized Suffix-Tree Storage for Biological Applications , 2005, HiPC.

[16]  Malcolm P. Atkinson,et al.  A Database Index to Large Biological Sequences , 2001, VLDB.

[17]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[18]  Jignesh M. Patel,et al.  Practical methods for constructing suffix trees , 2005, The VLDB Journal.

[19]  Marek J. Sergot,et al.  Distributed and Paged Suffix Trees for Large Genetic Databases , 2003, CPM.

[20]  S. Salzberg,et al.  Fast algorithms for large-scale genome alignment and comparison. , 2002, Nucleic acids research.

[21]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[22]  Eugene L. Lawler,et al.  Sublinear approximate string matching and biological applications , 1994, Algorithmica.

[23]  Jens Stoye,et al.  Suffix Tree Construction and Storage with Limited Main Memory , 2003 .

[24]  A. L. Brown,et al.  Constructing Genome Scale Suffix Trees , 2004, APBC.

[25]  Mohammed J. Zaki,et al.  Genome-scale disk-based suffix tree indexing , 2007, SIGMOD '07.

[26]  Esko Ukkonen,et al.  Constructing Suffix Trees On-Line in Linear Time , 1992, IFIP Congress.

[27]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[28]  Hongjun Lu,et al.  Constructing suffix tree for gigabyte sequences with megabyte memory , 2005, IEEE Transactions on Knowledge and Data Engineering.

[29]  Jens Stoye,et al.  Linear time algorithms for finding and representing all the tandem repeats in a string , 2004, J. Comput. Syst. Sci..

[30]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[31]  Srikanta J. Bedathur,et al.  Engineering a fast online persistent suffix tree construction , 2004, Proceedings. 20th International Conference on Data Engineering.