Analysis of the quality and utility of random shotgun sequencing at low redundancies.

The currently favored approach for sequencing the human genome involves selecting representative large-insert clones (100-200 kb), randomly shearing this DNA to construct shotgun libraries, and then sequencing many different isolates from the library. This method, entitled directed random shotgun sequencing, requires highly redundant sequencing to obtain a complete and accurate finished consensus sequence. Recently it has been suggested that a rapidly generated lower redundancy sequence might be of use to the scientific community. Low-redundancy sequencing has been examined previously using simulated data sets. Here we utilize trace data from a number of projects submitted to GenBank to perform reconstruction experiments that mimic low-redundancy sequencing. These low-redundancy sequences have been examined for the completeness and quality of the consensus product, information content, and usefulness for interspecies comparisons. The data presented here suggest three different sequencing strategies, each with different utilities. (1) Nearly complete sequence data can be obtained by sequencing a random shotgun library at sixfold redundancy. This may therefore represent a good point to switch from a random to directed approach. (2) Sequencing can be performed with as little as twofold redundancy to find most of the information about exons, EST hits, and putative exon similarity matches. (3) To obtain contiguity of coding regions, sequencing at three- to fourfold redundancy would be appropriate. From these results, we suggest that a useful intermediate product for genome sequencing might be obtained by three- to fourfold redundancy. Such a product would allow a large amount of biologically useful data to be extracted while postponing the majority of work involved in producing a high quality consensus sequence.

[1]  W. Miller,et al.  Long human-mouse sequence alignments reveal novel regulatory elements: a reason to sequence the mouse genome. , 1997, Genome research.

[2]  M. Sawicki,et al.  Human Genome Project. , 1993, American journal of surgery.

[3]  G. Rubin,et al.  A computer program for aligning a cDNA sequence with a genomic DNA sequence. , 1998, Genome research.

[4]  C. Caskey,et al.  Closure strategies for random DNA sequencing , 1991 .

[5]  G. D. Smith,et al.  BULLET: a computer simulation of shotgun DNA sequencing , 1995, Comput. Appl. Biosci..

[6]  A. Clark,et al.  Sequencing errors and molecular evolutionary analysis. , 1992, Molecular biology and evolution.

[7]  P. Green,et al.  Against a whole-genome shotgun. , 1997, Genome research.

[8]  M H Skolnick,et al.  Software trapping: a strategy for finding genes in large genomic regions. , 1995, Computers and biomedical research, an international journal.

[9]  J. Claverie,et al.  A streamlined random sequencing strategy for finding coding exons. , 1994, Genomics.

[10]  C. R. Connell,et al.  DNA sequencing with dye-labeled terminators and T7 DNA polymerase: effect of dyes and dNTPs on incorporation of dye-terminators and probability analysis of termination fragments. , 1992, Nucleic acids research.

[11]  E. Lander,et al.  Genomic mapping by fingerprinting random clones: a mathematical analysis. , 1988, Genomics.

[12]  M. Adams,et al.  Shotgun Sequencing of the Human Genome , 1998, Science.

[13]  J. Roach,et al.  Pairwise end sequencing: a unified approach to genomic mapping and sequencing. , 1995, Genomics.

[14]  R. Gibbs,et al.  Large-scale sequencing in human chromosome 12p13: experimental and computational gene structure determination. , 1997, Genome research.

[15]  Representation of cloned genomic sequences in two sequencing vectors: correlation of DNA sequence and subclone distribution. , 1997, Nucleic acids research.

[16]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[17]  J. Weber,et al.  Human whole-genome shotgun sequencing. , 1997, Genome research.

[18]  L. Hillier,et al.  Genomic DNA sequencing methods. , 1995, Methods in cell biology.

[19]  R. Gibbs,et al.  Comparative sequence analysis of a gene-rich cluster at human chromosome 12p13 and its syntenic region in mouse chromosome 6. , 1998, Genome research.

[20]  E Marshall,et al.  Human genome project. Emphasis turns from mapping to large-scale sequencing. , 1995, Science.

[21]  C. Lilley,et al.  A gene-rich cluster between the CD4 and triosephosphate isomerase genes at human chromosome 12p13. , 1996, Genome research.

[22]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[23]  P. Green,et al.  A "quality-first" credo for the Human Genome Project. , 1998, Genome research.