Whole-genome shotgun assembly and comparison of human genome assemblies

We report a whole-genome shotgun assembly (called WGSA) of the human genome generated at Celera in 2001. The Celera-generated shotgun data set consisted of 27 million sequencing reads organized in pairs by virtue of end-sequencing 2-kbp, 10-kbp, and 50-kbp inserts from shotgun clone libraries. The quality-trimmed reads covered the genome 5.3 times, and the inserts from which pairs of reads were obtained covered the genome 39 times. With the nearly complete human DNA sequence [National Center for Biotechnology Information (NCBI) Build 34] now available, it is possible to directly assess the quality, accuracy, and completeness of WGSA and of the first reconstructions of the human genome reported in two landmark papers in February 2001 [Venter, J. C., Adams, M. D., Myers, E. W., Li, P. W., Mural, R. J., Sutton, G. G., Smith, H. O., Yandell, M., Evans, C. A., Holt, R. A., et al. (2001) Science 291, 1304–1351; International Human Genome Sequencing Consortium (2001) Nature 409, 860–921]. The analysis of WGSA shows 97% order and orientation agreement with NCBI Build 34, where most of the 3% of sequence out of order is due to scaffold placement problems as opposed to assembly errors within the scaffolds themselves. In addition, WGSA fills some of the remaining gaps in NCBI Build 34. The early genome sequences all covered about the same amount of the genome, but they did so in different ways. The Celera results provide more order and orientation, and the consortium sequence provides better coverage of exact and nearly exact repeats.

[1]  R. Fleischmann,et al.  Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. , 1995, Science.

[2]  J. Craig Venter,et al.  A new strategy for genome sequencing , 1996, Nature.

[3]  R. Quatrano Genomics , 1998, Plant Cell.

[4]  E. Eichler,et al.  Masquerading repeats: paralogous pitfalls of the human genome. , 1998, Genome research.

[5]  G. Rubin,et al.  A computer program for aligning a cDNA sequence with a genomic DNA sequence. , 1998, Genome research.

[6]  G. Mahairas,et al.  Human BAC ends quality assessment and sequence analyses. , 2000, Genomics.

[7]  Stephen M. Mount,et al.  The genome sequence of Drosophila melanogaster. , 2000, Science.

[8]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[9]  김삼묘,et al.  “Bioinformatics” 특집을 내면서 , 2000 .

[10]  D. Haussler,et al.  A physical map of the human genome , 2001, Nature.

[11]  Timothy B. Stockwell,et al.  The Sequence of the Human Genome , 2001, Science.

[12]  Eugene W. Myers,et al.  Design of a compartmentalized shotgun assembler for the human genome , 2001, ISMB.

[13]  Donna R. Maglott,et al.  RefSeq and LocusLink: NCBI gene-centered resources , 2001, Nucleic Acids Res..

[14]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[15]  Jian Wang,et al.  The Genome Sequence of the Malaria Mosquito Anopheles gambiae , 2002, Science.

[16]  Eric S. Lander,et al.  On the sequencing of the human genome , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Eugene W. Myers,et al.  The greedy path-merging algorithm for contig scaffolding , 2002, JACM.

[18]  Paramvir S. Dehal,et al.  Whole-Genome Shotgun Assembly and Analysis of the Genome of Fugu rubripes , 2002, Science.

[19]  Colin N. Dewey,et al.  Initial sequencing and comparative analysis of the mouse genome. , 2002 .

[20]  William H. Majoros,et al.  A Comparison of Whole-Genome Shotgun-Derived Mouse Chromosome 16 and the Human Genome , 2002, Science.

[21]  M. Adams,et al.  Recent Segmental Duplications in the Human Genome , 2002, Science.

[22]  Lincoln Stein,et al.  The SNP Consortium website: past, present and future , 2003, Nucleic Acids Res..

[23]  F. Collins,et al.  A vision for the future of genomics research , 2003, Nature.

[24]  Nicholas L. Bray,et al.  AVID: A global alignment program. , 2003, Genome research.

[25]  Chuong B. Do,et al.  Access the most recent version at doi: 10.1101/gr.926603 References , 2003 .

[26]  A. Halpern,et al.  Massive parallelism, randomness and genomic advances , 2003, Nature Genetics.

[27]  E. Lander,et al.  More on the sequencing of the human genome , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[28]  E. Kirkness,et al.  The Dog Genome: Survey Sequencing and Comparative Analysis , 2003, Science.

[29]  I. Dunham,et al.  DNA sequence and analysis of human chromosome 9 , 2003, Nature.