Bridging the gap between sequence and function.

The idea behind this meeting was to look into new ideas and approaches in computational biology at the interface between the genomic and post-genomic eras. The organizers intended the meeting to be a mix of reports on advances in more or less traditional computational biology subjects and new studies that strive to provide more direct links between computation and biological function. The traditional themes include gene recognition, sequence similarity searches, connections between protein sequence, structure and function, and organizing different types of information into value-added databases. Perhaps the most notable advances were reported in the latter area. Minoru Kanehisa (Kyoto University) described the current status of the KEGG (Kyoto Encyclopedia of Genes and Genomes) database1xKEGG: Kyoto Encyclopedia of Genes and Genomes. Ogata, H. et al. Nucleic Acids Res. 1999; 27: 29–34Crossref | PubMed | Scopus (1314)See all References1. The remarkable feature of KEGG is that it is not just a database but a research tool that allows one to compute correlations between clusters of proteins produced on the basis of different criteria (e.g. sequence families and protein sets unified by function), which, on many occasions, can result in valuable new insights. Chris Ponting (National Center for Biotechnology Information, NIH, Bethesda, USA and Oxford University, UK) described the smart (Simple Modular Architecture Research Tool) search engine2xSMART: identification and annotation of domains from signalling and extracellular protein sequences. Ponting, C.P. et al. Nucleic Acids Res. 1999; 27: 229–232Crossref | PubMed | Scopus (188)See all References2. smart is an example of a new generation of sequence analysis tools that are not only powerful in terms of sensitivity but also synthesize information at a higher level and provide the user with a real annotation of a protein sequence, rather than just a set of likely homologs. Peer Bork (EMBL, Heidelberg) described a number of computational analyses that explicitly take advantage of different types of information, such as patterns of phylogenetic conservation, operon organization and gene expression level, to predict gene function. Perhaps it would be fair to say that the current emphasis in computational biology is not so much on new analytical methods but, rather, on synthetic approaches that allow a fresh view of the data3xPredicting functions from protein sequences—where are the bottlenecks?. Bork, P. and Koonin, E.V. Nat. Genet. 1998; 18: 313–318Crossref | PubMed | Scopus (235)See all References3.Several talks concentrated on the use of computational approaches to attack long-standing, fundamental problems in biology. Alexey Kondrashov (National Center for Biotechnology Information, NIH, Bethesda, USA) discussed the use of genome sequence data in his quest for the Holy Grail of evolutionary genetics – determining the number of spontaneous deleterious mutations per genome per generation. This critical parameter (U) is determined from the obvious equation U = T × F where T is the total number of mutations per genome and F is the fraction of DNA that is subject to stabilizing selection. T can be easily estimated from the number of differences between the sequences of pseudogenes in closely related species, and a value of T > 100 for humans appears to be reliable. Estimating F, however, is a major problem. Kondrashov and co-workers approached it by comparing intergenic regions from the genomes of two nematodes, Caenorhabditis elegans and Caenorhabditis briggsae and arrived at a value of F ∼ 0.2 (Ref. 4xPattern of selective constraint in C. elegans and C. briggsae genomes. Shabalina, S.A. and Kondrashov, A.S. Genet. Res. 1999; 74: 23–30Crossref | PubMed | Scopus (77)See all ReferencesRef. 4). In other words, it appears that about 20% of the genome is under stabilizing selection. Should this be transferable to humans, it will translate into U > 10. The implications for human population and medical genetics are clear and should not be underestimated. No doubt, a lot more data is necessary before we can be confident about these estimates. Nevertheless, this work is notable in that it highlights the impact that genome data combined with ingenious computational and theoretical approaches might have, both on our understanding of fundamental problems in biology and on very practical issues related to the future of the human race.Laura Landweber (Princeton University, USA) described the work of her group aimed at solving the old dilemma of the nature and origin of the genetic code – frozen accident or chemical necessity5xSelection, history and chemistry: the three faces of the genetic code. Knight, R.D. et al. Trends Biochem. Sci. 1999; 24: 241–247Abstract | Full Text | Full Text PDF | PubMed | Scopus (129)See all References5. It appears that this old problem might finally be yielding, and that the answer is somewhat unexpected in that direct interactions between amino acids and codons might have been critical in establishing the code. Towards the end of the meeting, there seemed to be a degree of consensus about at least one important direction on the evolution of computational biology. As the field is coming of age, the emphasis seems to be shifting towards the biology part and it is likely to become one of the hot biological disciplines in the post-genomic era.