High-quality draft assemblies of mammalian genomes from massively parallel sequence data

Massively parallel DNA sequencing technologies are revolutionizing genomics by making it possible to generate billions of relatively short (~100-base) sequence reads at very low cost. Whereas such data can be readily used for a wide range of biomedical applications, it has proven difficult to use them to generate high-quality de novo genome assemblies of large, repeat-rich vertebrate genomes. To date, the genome assemblies generated from such data have fallen far short of those obtained with the older (but much more expensive) capillary-based sequencing approach. Here, we report the development of an algorithm for genome assembly, ALLPATHS-LG, and its application to massively parallel DNA sequence data from the human and mouse genomes, generated on the Illumina platform. The resulting draft genome assemblies have good accuracy, short-range contiguity, long-range connectivity, and coverage of the genome. In particular, the base accuracy is high (≥99.95%) and the scaffold sizes (N50 size = 11.5 Mb for human and 7.2 Mb for mouse) approach those obtained with capillary-based sequencing. The combination of improved sequencing technology and improved computational methods should now make it possible to increase dramatically the de novo sequencing of large genomes. The ALLPATHS-LG program is available at http://www.broadinstitute.org/science/programs/genome-biology/crd.

[1]  F. Collins,et al.  Directional cloning of DNA fragments at a large distance from an initial probe: a circularization method. , 1984, Proceedings of the National Academy of Sciences of the United States of America.

[2]  P Green,et al.  Base-calling of automated sequencer traces using phred. II. Error probabilities. , 1998, Genome research.

[3]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[4]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Mouse Genome Sequencing Consortium Initial sequencing and comparative analysis of the mouse genome , 2002, Nature.

[6]  Colin N. Dewey,et al.  Initial sequencing and comparative analysis of the mouse genome. , 2002 .

[7]  E. Lander,et al.  Finishing the euchromatic sequence of the human genome , 2004 .

[8]  J. Bonfield,et al.  Finishing the euchromatic sequence of the human genome , 2004, Nature.

[9]  International Human Genome Sequencing Consortium Finishing the euchromatic sequence of the human genome , 2004 .

[10]  James A. Cuff,et al.  Genome sequence, comparative analysis and haplotype structure of the domestic dog , 2005, Nature.

[11]  Timothy B. Stockwell,et al.  The Diploid Genome Sequence of an Individual Human , 2007, PLoS biology.

[12]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[13]  Miriam K. Konkel,et al.  Genome analysis of the platypus reveals unique signatures of evolution , 2008, Nature.

[14]  Nancy F. Hansen,et al.  Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry , 2008, Nature.

[15]  E. Eichler,et al.  Mouse segmental duplication and copy number variation , 2008, Nature Genetics.

[16]  C. Nusbaum,et al.  ALLPATHS: de novo assembly of whole-genome shotgun microreads. , 2008, Genome research.

[17]  Arcadi Navarro,et al.  A burst of segmental duplications in the genome of the African great ape ancestor , 2009, Nature.

[18]  A. Gnirke,et al.  ALLPATHS 2: small genomes assembled accurately and with high continuity from short paired reads , 2009, Genome Biology.

[19]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[20]  Z. Ning,et al.  Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of GC-biased genomes , 2009, Nature Methods.

[21]  E. Hayden 10,000 genomes to come , 2009, Nature.

[22]  Peter A. Meric,et al.  Lineage-Specific Biology Revealed by a Finished Genome Assembly of the Mouse , 2009, PLoS biology.

[23]  Tom H. Pringle,et al.  Complete Khoisan and Bantu genomes from southern Africa , 2010, Nature.

[24]  Dawei Li,et al.  The sequence and de novo assembly of the giant panda genome , 2010, Nature.

[25]  J. Stajich,et al.  De novo Assembly of a 40 Mb Eukaryotic Genome from Short Sequence Reads: Sordaria macrospora, a Model Organism for Fungal Morphogenesis , 2010, PLoS genetics.

[26]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[27]  Huanming Yang,et al.  De novo assembly of human genomes with massively parallel short read sequencing. , 2010, Genome research.

[28]  E. Eichler,et al.  Limitations of next-generation genome sequence assembly , 2011, Nature Methods.

[29]  Emmanouil Collab A map of human genome variation from population-scale sequencing , 2011, Nature.