ARACHNE: a whole-genome shotgun assembler.

We describe a new computer system, called ARACHNE, for assembling genome sequence using paired-end whole-genome shotgun reads. ARACHNE has several key features, including an efficient and sensitive procedure for finding read overlaps, a procedure for scoring overlaps that achieves high accuracy by correcting errors before assembly, read merger based on forward-reverse links, and detection of repeat contigs by forward-reverse link inconsistency. To test ARACHNE, we created simulated reads providing approximately 10-fold coverage of the genomes of H. influenzae, S. cerevisiae, and D. melanogaster, as well as human chromosomes 21 and 22. The assemblies of these simulated reads yielded nearly complete coverage of the respective genomes, with a small number of contigs joined into a smaller number of supercontigs (or scaffolds). For example, analysis of the D. melanogaster genome yielded approximately 98% coverage with an N50 contig length of 324 kb and an N50 supercontig length of 5143 kb. The assembly accuracy was high, although not perfect: small errors occurred at a frequency of roughly 1 per 1 Mb (typically, deletion of approximately 1 kb in size), with a very small number of other misassemblies. The assembly was rapid: the Drosophila assembly required only 21 hours on a single 667 MHz processor and used 8.4 Gb of memory.

[1]  T. L. Lyon,et al.  The Nature and Properties of Soils , 1930 .

[2]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[3]  F. Sanger,et al.  Nucleotide sequence of bacteriophage φX174 DNA , 1977, Nature.

[4]  F. Sanger,et al.  DNA sequencing with chain-terminating inhibitors. , 1977, Proceedings of the National Academy of Sciences of the United States of America.

[5]  G. Hornberger,et al.  Empirical equations for some soil hydraulic properties , 1978 .

[6]  R. Staden,et al.  Nucleotide sequence of bacteriophage G4 DNA , 1978, Nature.

[7]  F. Sanger,et al.  Nucleotide sequence of bacteriophage lambda DNA. , 1982, Journal of molecular biology.

[8]  R. Miller,et al.  Chemical and microbiological properties , 1982 .

[9]  E. O. Mclean Soil pH and Lime Requirement , 1982 .

[10]  Ralph C. Heath,et al.  Ground-water regions of the United States , 1984 .

[11]  Hans Söderlund,et al.  SEQAID: a DNA sequence assembling program based on a mathematical model , 1984, Nucleic Acids Res..

[12]  A. Klute,et al.  Physical and mineralogical methods , 1986 .

[13]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[14]  H Pastides,et al.  How much soil do young children ingest: an epidemiologic study. , 1989, Regulatory toxicology and pharmacology : RTP.

[15]  Edward J. Calabrese,et al.  Petroleum contaminated soils. Volume 2 , 1989 .

[16]  J. Konz,et al.  Exposure factors handbook , 1989 .

[17]  H. Erfle,et al.  Automated DNA sequencing of the human HPRT locus. , 1990, Genomics.

[18]  L. Tuxen,et al.  Integrated risk information system (IRIS) , 1990 .

[19]  R. Buschbom,et al.  Quantitative estimates of soil ingestion in normal children between the ages of 2 and 7 years: population-based estimates using aluminum, silicon, and titanium as soil tracer elements. , 1990, Archives of environmental health.

[20]  B. Brunekreef,et al.  Estimated soil ingestion by children. , 1990, Environmental research.

[21]  Charles J. Newell,et al.  A Hydrogeologic Database for Ground-Water Modeling , 1990 .

[22]  David M. Nielsen,et al.  Practical Handbook of Ground-Water Monitoring , 1991 .

[23]  X. Huang,et al.  A contig assembly program based on sensitive detection of fragment overlaps. , 1992, Genomics.

[24]  C. Huynh,et al.  A genetic mapping system in Caenorhabditis elegans based on polymorphic sequence-tagged sites. , 1992, Genetics.

[25]  J. Russell Boulding Subsurface characterization and monitoring techniques : a desk reference guide , 1994 .

[26]  Benjamin J. Mason,et al.  Preparation of Soil Sampling Protocols: Sampling Techniques and Strategies , 1994 .

[27]  Owen White,et al.  TIGR Assembler: A New Tool for Assembling Large Shotgun Sequencing Projects , 1995 .

[28]  A. Coulson,et al.  Meiotic recombination, noncoding DNA and genomic organization in Caenorhabditis elegans. , 1995, Genetics.

[29]  R. Fleischmann,et al.  Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. , 1995, Science.

[30]  Eugene W. Myers,et al.  Toward Simplifying and Accurately Formulating Fragment Assembly , 1995, J. Comput. Biol..

[31]  Francis S. Collins,et al.  Positional cloning moves from perditional to traditional , 1995, Nature Genetics.

[32]  B. Barrell,et al.  Life with 6000 Genes , 1996, Science.

[33]  Steven Skiena,et al.  Trie-Based Data Structures for Sequence Assembly , 1997, CPM.

[34]  J. Berg Genome sequence of the nematode C. elegans: a platform for investigating biology. , 1998, Science.

[35]  Andrew Smith Genome sequence of the nematode C-elegans: A platform for investigating biology , 1998 .

[36]  A. Fire,et al.  Potent and specific genetic interference by double-stranded RNA in Caenorhabditis elegans , 1998, Nature.

[37]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[38]  Melanie E. Goward,et al.  The DNA sequence of human chromosome 22 , 1999, Nature.

[39]  S. Kim,et al.  AMASS: A Structured Pattern Matching Approach to Shotgun Sequence Assembly , 1998, J. Comput. Biol..

[40]  K. Kornfeld,et al.  A local, high-density, single-nucleotide polymorphism map used to clone Caenorhabditis elegans cdf-1. , 1999, Genetics.

[41]  P. Kwok,et al.  Fluorescence polarization in homogeneous nucleic acid analysis. , 1999, Genome research.

[42]  X. Huang,et al.  CAP3: A DNA sequence assembly program. , 1999, Genome research.

[43]  The Arabidopsis Genome Initiative Analysis of the genome sequence of the flowering plant Arabidopsis thaliana , 2000, Nature.

[44]  Stephen M. Mount,et al.  The genome sequence of Drosophila melanogaster. , 2000, Science.

[45]  P. Zipperlen,et al.  Functional genomic analysis of C. elegans chromosome I by systematic RNA interference , 2000, Nature.

[46]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[47]  Serafim Batzoglou,et al.  Computational genomics: mapping, comparison, and annotation of genomes , 2000 .

[48]  Y Sakaki,et al.  The DNA sequence of human chromosome 21. , 2000, Nature.

[49]  B. Harfe,et al.  dpy-18 Encodes an α-Subunit of Prolyl-4-Hydroxylase in Caenorhabditis elegans , 2000 .

[50]  M. Hattori,et al.  The DNA sequence of human chromosome 21 , 2000, Nature.

[51]  Sebastian A. Leidel,et al.  Functional genomic analysis of cell division in C. elegans using RNAi of genes on chromosome III , 2000, Nature.

[52]  M. A. van der Horst,et al.  Single nucleotide polymorphisms in wild isolates of Caenorhabditis elegans. , 2000, Genome research.

[53]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[54]  Paul W. Sternberg,et al.  WormBase: network access to the genome and biology of Caenorhabditis elegans , 2001, Nucleic Acids Res..

[55]  P. Lefebvre,et al.  Development and characterization of genome-wide single nucleotide polymorphism markers in the green alga Chlamydomonas reinhardtii. , 2001, Plant physiology.

[56]  D. J. Matthews,et al.  Using model-system genetics for drug-based target discovery. , 2001, Drug discovery today.

[57]  Haixu Tang,et al.  A new approach to fragment assembly in DNA sequencing , 2001, RECOMB.

[58]  P. Sternberg,et al.  Working in the Post-Genomic C. elegans World , 2001, Cell.

[59]  P. Kwok,et al.  Methods for genotyping single nucleotide polymorphisms. , 2003, Annual review of genomics and human genetics.

[60]  Yuji Kohara,et al.  Large-scale analysis of gene function in Caenorhabditis elegans by high-throughput RNAi , 2001, Current Biology.

[61]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[62]  W. Gish,et al.  Rapid gene mapping in Caenorhabditis elegans using a high density polymorphism map , 2001, Nature Genetics.