Genome Assembly Framework on Massively Parallel, Distributed Memory Supercomputers

Genome Assembly describes the process of as- sembling a long Deoxyribonucleic acid sequence out of next generation sequencing (NGS) data. Computational re- sources can become a bottleneck or large scale routine use. We propose a genome assembly framework for massively parallel, distributed memory supercomputers. Our frame- works builds on the simple idea to equally distribute the number of reads to each processor. Each processor holds the whole reference genome. Each processor aligns the short reads independently and sends the reads back to root processor together with the corresponding position and the whole genome can be aligned. We run our alignment frame- work on up to 8,196 processors of the IBM Blue Gene/Q "Avoca" at the Victorian Life Science Computation Initia- tive. The results show that more than 6 Million reads of over 324 Million nucleotides can be assembled in under 20 minutes versus previously requiring hours. Thus, our framework allows fast assembly of NGS data.

[1]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[2]  Robert S. Boyer,et al.  A fast string searching algorithm , 1977, CACM.

[3]  Douglas Thain,et al.  Highly scalable genome assembly on campus grids , 2009, MTAGS '09.

[4]  Srinivas Aluru,et al.  Assembling genomes on large-scale parallel computers , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.