What we do not know about sequence analysis and sequence databases.

The marriage of high-throughput nucleotide sequencing with computational methods for the analysis of nucleotide and protein sequences have ushered in a new era of molecular biology. Entire genomes are deposited into the sequence DBs at a growing rate. Typically, investigators can use computational sequence analysis to assign functions to the majority of the open reading frames in genome sequences. That analysis can identify a surprisingly large fraction of the genes within the organism. That fraction is increasing over time as the sequence databases contain a larger fraction of all functional domains. The growing wealth of information within the sequence databases provides a foundation for the biology of the 21st Century. We will mine these data for decades to come, developing complex and incredibly accurate cellular models that can predict the behavior of living systems by integrating across the functions of their molecular parts. Or will we? Although the preceding scenario is the likely one, we would be irresponsible to not consider another possible outcome: an explosion of incorrect annotations within the sequence databases. Each new sequence deposited in the public databases has been annotated with respect to those same databases. Functional annotations are propagated repeatedly from one sequence to the next, to the next, with no record made of the source of a given annotation, leading to a potential transitive catastrophe of erroneous annotations. Investigators who later attempt to separate the wheat from the chaff will discover that they cannot simply retreat to the safety of experimentally annotated sequences by ignoring the computationally annotated sequences, because the public DBs do not explicitly distinguish the two sets. In fact, the public sequence DBs keep virtually no tracking information about the methods used to annotate their data. Can we rule this possibility out on any objective grounds? No. We have no reliable data regarding either the current rate of errors (incorrect functional annotations) within the public DBs, nor on the rate of change of that error rate (we do not even know if it is increasing or decreasing each year). Many years of research have led to the development of detailed statistical models for sequence-similarity searching algorithms such as FASTA and the BLAST family of programs. Researchers employ these algorithms to identify the functions of novel sequences in two phases. In phase I, they identify homologs of a novel sequence. In phase II, they infer the function of the novel sequence with …