Assembling genomes on large-scale parallel computers

Assembly of large genomes from tens of millions of short genomic fragments is computationally demanding requiring hundreds of gigabytes of memory and tens of thousands of CPU hours. The advent of high throughput sequencing technologies, new gene-enrichment sequencing strategies, and collective sequencing of environmental samples further exacerbate this situation. In this paper, we present the first massively parallel genome assembly framework. The unique features of our approach include space-efficient and on-demand algorithms that consume only linear space, and strategies to reduce the number of expensive pairwise sequence alignments while maintaining assembly quality. Developed as part of the ongoing efforts in maize genome sequencing, we applied our assembly framework to genomic data containing a mixture of gene enriched and random shotgun sequences. We report the partitioning of more than 1.6 million fragments of over 1.25 billion nucleotides total size into genomic islands in under 2h on 1024 processors of an IBM BlueGene/L supercomputer. We also demonstrate the effectiveness of the proposed approach for traditional whole genome shotgun sequencing and assembly of environmental sequences.

[1]  J. Mullikin,et al.  The phusion assembler. , 2003, Genome research.

[2]  Srinivas Aluru,et al.  Quality assessment of maize assembled genomic islands (MAGIs) and large-scale experimental verification of predicted genes. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[3]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[4]  X. Huang,et al.  CAP3: A DNA sequence assembly program. , 1999, Genome research.

[5]  W. Richard McCombie,et al.  Sorghum Genome Sequencing by Methylation Filtration , 2005, PLoS biology.

[6]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[7]  Paul G Falkowski,et al.  Shotgun Sequencing in the Sea: A Blast from the Past? , 2004, Science.

[8]  S. Dike,et al.  Maize Genome Sequencing by Methylation Filtration , 2003, Science.

[9]  E. D. Earle,et al.  Nuclear DNA content of some important plant species , 1991, Plant Molecular Biology Reporter.

[10]  D. Haussler,et al.  Assembly of the working draft of the human genome with GigAssembler. , 2001, Genome research.

[11]  Robert A. Martienssen,et al.  Differential methylation of genes and retrotransposons facilitates shotgun sequencing of the maize genome , 1999, Nature Genetics.

[12]  Srinivas Aluru,et al.  A strategy for assembling the maize (Zea mays L.) genome , 2004, Bioinform..

[13]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[14]  B. Berger,et al.  ARACHNE: a whole-genome shotgun assembler. , 2002, Genome research.

[15]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[16]  Timothy B. Stockwell,et al.  The Sequence of the Human Genome , 2001, Science.

[17]  O. White,et al.  Environmental Genome Shotgun Sequencing of the Sargasso Sea , 2004, Science.

[18]  Srinivas Aluru,et al.  Space and time efficient parallel algorithms and software for EST clustering , 2003, IEEE Trans. Parallel Distributed Syst..

[19]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Inna Dubchak,et al.  Comparative genome sequencing of Drosophila pseudoobscura: chromosomal, gene, and cis-element evolution. , 2005, Genome research.

[21]  Hui-Hsien Chou,et al.  DNA sequence quality trimming and vector removal , 2001, Bioinform..

[22]  Yinan Yuan,et al.  High-Cot sequence analysis of the maize genome. , 2003, The Plant journal : for cell and molecular biology.

[23]  Owen White,et al.  TIGR Assembler: A New Tool for Assembling Large Shotgun Sequencing Projects , 1995 .

[24]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[25]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[26]  J Quackenbush,et al.  Enrichment of Gene-Coding Sequences in Maize by Genome Filtration , 2003, Science.

[27]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[28]  J. Bennetzen,et al.  National Science Foundation-sponsored workshop report. Maize genome sequencing project. , 2001, Plant physiology.

[29]  G. Weinstock,et al.  The Atlas genome assembly system. , 2004, Genome research.

[30]  F. Sanger,et al.  Nucleotide sequence of bacteriophage lambda DNA. , 1982, Journal of molecular biology.

[31]  L. Hillier,et al.  PCAP: a whole-genome assembly program. , 2003, Genome research.