A Fast Divide-and-Conquer Algorithm for Indexing Human Genome Sequences

Since the release of human genome sequences, one of the most important research issues is about indexing the genome sequences, and the suffix tree is most widely adopted for that purpose. The traditional suffix tree construction algorithms suffer from severe performance degradation due to the memory bottleneck problem. The recent disk-based algorithms also provide limited performance improvement due to random disk accesses. Moreover, they do not fully utilize the recent CPUs with multiple cores. In this paper, we propose a fast algorithm based on ‘divide-and-conquer’ strategy for indexing the human genome sequences. Our algorithm nearly eliminates random disk accesses by accessing the disk in the unit of contiguous chunks. In addition, our algorithm fully utilizes the multi-core CPUs by dividing the genome sequences into multiple partitions and then assigning each partition to a different core for parallel processing. Experimental results show that our algorithm outperforms the previous fastest DIGEST algorithm by up to 10.5 times.

[1]  Jignesh M. Patel,et al.  Practical methods for constructing suffix trees , 2005, The VLDB Journal.

[2]  김동규,et al.  [서평]「Algorithms on Strings, Trees, and Sequences」 , 2000 .

[3]  Siu-Ming Yiu,et al.  SOAP2: an improved ultrafast tool for short read alignment , 2009, Bioinform..

[4]  Malcolm P. Atkinson,et al.  Providing Orthogonal Persistence for Java (Extended Abstract) , 1998, ECOOP.

[5]  Hongjun Lu,et al.  Constructing suffix tree for gigabyte sequences with megabyte memory , 2005, IEEE Transactions on Knowledge and Data Engineering.

[6]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[7]  Mohammed J. Zaki,et al.  Genome-scale disk-based suffix tree indexing , 2007, SIGMOD '07.

[8]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[9]  M. Crochemore,et al.  On-line construction of suffix trees , 2002 .

[10]  Malcolm P. Atkinson,et al.  Database indexing for large DNA and protein sequence collections , 2002, The VLDB Journal.

[11]  Konstantin Makarychev,et al.  Serial and parallel methods for i/o efficient suffix tree construction , 2009, SIGMOD Conference.

[12]  Stefan Kurtz,et al.  REPuter: fast computation of maximal repeats in complete genomes , 1999, Bioinform..

[13]  Stefan Kurtz,et al.  Reducing the space requirement of suffix trees , 1999 .

[14]  Robert Giegerich,et al.  Efficient implementation of lazy suffix trees , 2003, Softw. Pract. Exp..

[15]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[16]  Sanghyun Park,et al.  A Practical Method for Approximate Subsequence Search in DNA Databases , 2007, PAKDD.

[17]  Wing-Kai Hon,et al.  On Entropy-Compressed Text Indexing in External Memory , 2009, SPIRE.

[18]  Srikanta J. Bedathur,et al.  Engineering a fast online persistent suffix tree construction , 2004, Proceedings. 20th International Conference on Data Engineering.

[19]  Emily Rocke Using Suffix Trees for Gapped Motif Discovery , 2000, CPM.

[20]  Mohammed J. Zaki,et al.  TRELLIS+: An Effective Approach for Indexing Genome-Scale Sequences Using Suffix Trees , 2008, Pacific Symposium on Biocomputing.

[21]  Alex Thomo,et al.  A new method for indexing genomes using on-disk suffix trees , 2008, CIKM '08.

[22]  Alistair Moffat,et al.  Improving suffix array locality for fast pattern matching on disk , 2008, SIGMOD Conference.