Gap statistics for whole genome shotgun DNA sequencing projects

MOTIVATION Investigators utilize gap estimates for DNA sequencing projects. Standard theories assume sequences are independently and identically distributed, leading to appreciable under-prediction of gaps. RESULTS Using a statistical scaling factor and data from 20 representative whole genome shotgun projects, we construct regression equations that relate coverage to a normalized gap measure. Prokaryotic genomes do not correlate to sequence coverage, while eukaryotes show strong correlation if the chaff is ignored. Gaps decrease at an exponential rate of only about one-third of that predicted via theory alone. Case studies suggest that departure from theory can largely be attributed to assembly difficulties for repeat-rich genomes, but bias and coverage anomalies are also important when repeats are sparse. Such factors cannot be readily characterized a priori, suggesting upper limits on the accuracy of gap prediction. We also find that diminishing coverage probability discussed in other studies is a theoretical artifact that does not arise for the typical project.

[1]  R. Fleischmann,et al.  The complete genome sequence of the hyperthermophilic, sulphate-reducing archaeon Archaeoglobus fulgidus , 1997, Nature.

[2]  C. Desmarais,et al.  Automated finishing with autofinish. , 2001, Genome research.

[3]  S. Salzberg,et al.  Genomic sequence of a Lyme disease spirochaete, Borrelia burgdorferi , 1997, Nature.

[4]  Paramvir S. Dehal,et al.  Whole-Genome Shotgun Assembly and Analysis of the Genome of Fugu rubripes , 2002, Science.

[5]  R N Re,et al.  On the sequencing of the human genome. , 2000, Hypertension.

[6]  Nikos Kyrpides,et al.  Genome Sequence and Analysis of the Oral Bacterium Fusobacterium nucleatum Strain ATCC 25586 , 2002, Journal of bacteriology.

[7]  Timothy B. Stockwell,et al.  The Sequence of the Human Genome , 2001, Science.

[8]  Eugene W Myers,et al.  The independence of our genome assemblies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Y. Nakamura,et al.  Complete genome sequence of the alkaliphilic bacterium Bacillus halodurans and genomic sequence comparison with Bacillus subtilis. , 2000, Nucleic acids research.

[10]  S. Salzberg,et al.  Complete genome sequence of Treponema pallidum, the syphilis spirochete. , 1998, Science.

[11]  Huanming Yang,et al.  A Draft Sequence of the Rice Genome (Oryza sativa L. ssp. japonica) , 2002, Science.

[12]  F. Sanger,et al.  Cloning in single-stranded bacteriophage as an aid to rapid DNA sequencing. , 1980, Journal of molecular biology.

[13]  Alan G. Konheim,et al.  The Random Division of an Interval and the Random Covering of a Circle , 1962 .

[14]  A. Danchin,et al.  Unique physiological and pathogenic features of Leptospira interrogans revealed by whole-genome sequencing , 2003, Nature.

[15]  L. Hillier,et al.  PCAP: a whole-genome assembly program. , 2003, Genome research.

[16]  Jian Wang,et al.  The Genome Sequence of the Malaria Mosquito Anopheles gambiae , 2002, Science.

[17]  E. Mauceli,et al.  Whole-genome sequence assembly for mammalian genomes: Arachne 2. , 2003, Genome research.

[18]  Colin N. Dewey,et al.  Initial sequencing and comparative analysis of the mouse genome. , 2002 .

[19]  F. W. Kellaway,et al.  Advanced Engineering Mathematics , 1969, The Mathematical Gazette.

[20]  George N. Bennett,et al.  Genome Sequence and Comparative Analysis of the Solvent-Producing Bacterium Clostridium acetobutylicum , 2001, Journal of bacteriology.

[21]  Stephen M. Mount,et al.  The genome sequence of Drosophila melanogaster. , 2000, Science.

[22]  S. Salzberg,et al.  Complete Genome Sequence of a Virulent Isolate of Streptococcus pneumoniae , 2001, Science.

[23]  Mouse Genome Sequencing Consortium Initial sequencing and comparative analysis of the mouse genome , 2002, Nature.

[24]  R. Fleischmann,et al.  Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. , 1995, Science.

[25]  E. Lander,et al.  Genomic mapping by fingerprinting random clones: a mathematical analysis. , 1988, Genomics.

[26]  Paul Richardson,et al.  The Draft Genome of Ciona intestinalis: Insights into Chordate and Vertebrate Origins , 2002, Science.

[27]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[28]  Michael C Wendl,et al.  Generalized gap model for bacterial artificial chromosome clone fingerprint mapping and shotgun sequencing. , 2002, Genome research.

[29]  S. Anderson,et al.  Shotgun DNA sequencing using cloned DNase I-generated fragments , 1981, Nucleic Acids Res..

[30]  Jonathan E. Allen,et al.  Genome sequence and comparative analysis of the model rodent malaria parasite Plasmodium yoelii yoelii , 2002, Nature.

[31]  Hubris and the Human Genome , 1998, Science.

[32]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[33]  Eugene W Myers,et al.  On the sequencing and assembly of the human genome , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[34]  A. Oliphant,et al.  A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). , 2002, Science.

[35]  Jian Wang,et al.  A complete sequence of the T. tengcongensis genome. , 2002, Genome research.

[36]  B. Berger,et al.  ARACHNE: a whole-genome shotgun assembler. , 2002, Genome research.

[37]  Huanming Yang,et al.  RePS: a sequence assembler that masks exact repeats identified from the shotgun data. , 2002, Genome research.

[38]  E. Lander,et al.  More on the sequencing of the human genome , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[39]  K. Novak The complete genome sequence… , 1998, Nature Medicine.

[40]  J. Roach Random subcloning. , 1995, Genome research.

[41]  P. Deininger Random subcloning of sonicated DNA: application to shotgun DNA sequence analysis. , 1983, Analytical biochemistry.

[42]  Natalia N. Ivanova,et al.  The genome sequence of the facultative intracellular pathogen Brucella melitensis , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[43]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[44]  E. Marshall A High-Stakes Gamble on Genome Sequencing , 1999, Science.

[45]  E. Mauceli,et al.  The genome sequence of the filamentous fungus Neurospora crassa , 2003, Nature.

[46]  J. Mullikin,et al.  The phusion assembler. , 2003, Genome research.