The reader of James Joyce’s Portrait of the Artist as a Young Man (1992) is aided by editor’s notes illuminating the meaning of unfamiliar words, for example, ‘‘greaves in his number’’ means simply ‘‘shinguards in his locker’’ and ‘‘a cod’’ is a joke or prank. Such minimal explanatory notes are essential, especially for a novice Joyce reader; however, overly detailed comments can be misleading and can stifle the reader’s own interpretation of the work. During the current scale-up phase of human genome sequencing, production groups have been experimenting with various types and levels of annotation. Precedents for the biological annotation of sequence records in GenBank, however, come from qualitatively and quantitatively different types of sequences from those currently being produced. Historically, the first type of sequence record is that of the ‘‘functionally cloned’’ gene, which is the end product of often years of investigation that began with a particular biological problem in mind. There is usually a one-to-one correspondence between these records and peer-reviewed publications. The second type of sequence record might be described as the result of a population study where many isolates of a particular gene are determined for the purpose of detecting and interpreting variations. Examples include ribosomal genes used to study molecular phylogeny, HIV sequences used to study antigenic variation, and, most recently, copies of human genes from different individuals used to detect sequence polymorphisms for the development of genetic markers. For this second class of sequence data, a multiple alignment is often the most meaningful and appropriate type of annotation. Literature citations to published articles also usually accompany this type of database record. A third major class of GenBank sequences consists of single-pass expressed sequence tags (ESTs). The most important annotations on these records are organism, tissue of origin, and cloning vector used. Other features are strictly computed and there are no publications relating to the specific nature of individual sequences. None of the traditional forms of annotation is a good model for the high throughput genomic sequence (HTGS) data now being produced. We will argue that the only annotations that are essential for HTGS data in a public database archive, apart from identification of the contributing laboratory, are the source organism from which the DNA was obtained and a confidence value or accuracy assessment of each base. Virtually everything else is computable on demand and/or quickly becomes obsolete. This is not to say that sequence exegesis is not useful or important; however, there will be many interpretations of the rich literature of the sequences of genomes and these interpretations will change over time. Resultant ‘‘annotations’’ will benefit from a publication modality that includes some version of the traditional peer review process for quality assurance. There are two broad categories of annotation that have been applied to HTGS data: (1) the results of computations, and (2) the results of experiments. What are some of the disadvantages of applying these types of detailed annotation to the archival reference sequence? It is obvious that the results of sequence similarity searches, particularly matches against ESTs, become out of date almost immediately and can easily and efficiently be recomputed daily, or on demand, by automated systems within an individual’s laboratory or from webbased facilities (Boguski and McEntyre 1994). There is also the problem that Randy Smith has referred to as ‘‘transitive annotation’’, whereby chains of inferences with weak links can lead to misleading or completely erroneous sequence interpretation (Smith 1996). The science of gene prediction based on intrinsic sequence properties is still quite fallible in producing accurate models of complete genes (Burset and Guigo 1996). Application of these methods has already had the insidious effect of populating the protein databases with conceptual translations of protein sequences that are partially right and partially wrong. Experimental data, such as the determination of full-length cDNA sequences for ESTs matching a genomic region, has also been used to annotate HTGS; however, the main disadvantages of any experimental validation of sequence features is added cost and potentially long delays in the submission of ‘‘finished’’ sequence. Even computerbased annotation alone has these effects on the ‘‘bottom line’’ and should be subjected to cost–benefit analysis. Because we are suggesting continualupdate and compute-on-demand approaches to sequence annotation, is it realistic to expect that computational power will be sufficient to the tasks? It is illuminating in this context to look back exactly a decade ago to a 1988 article by Charles DeLisi (1988) concerned with emerging trends in computational biology. DeLisi [who, by the way, spearheaded the Human Genome Project under the auspices of the Department of Energy (Cook-Deegan 1994)], predicted that anticipated advances in computer speed would be unable to keep up with the growing sequence database and the demand for homology searches of the Corresponding author. E-MAIL boguski@ncbi.nlm.nih.gov; FAX (301) 480-9241. Insight/Outlook
[1]
M. Boguski,et al.
I think therefore I publish.
,
1994,
Trends in biochemical sciences.
[2]
Ralph E. Hoffman,et al.
The Gene Wars: Science, Politics, and the Human Genome
,
1996
.
[3]
R. Guigó,et al.
Evaluation of gene structure prediction programs.
,
1996,
Genomics.
[4]
M. Boguski.
The turning point in genome research.
,
1995,
Trends in biochemical sciences.
[5]
C DeLisi,et al.
Computers in molecular biology: current applications and emerging trends.
,
1988,
Science.
[6]
T. Pawson,et al.
Mammalian SH2-Containing Protein Tyrosine Phosphatases
,
1996,
Cell.
[7]
R. F. Smith,et al.
Perspectives: sequence data base searching in the era of large-scale genomic sequencing.
,
1996,
Genome research.