Guilt by association: contextual information in genome analysis.

The genome sequences churned out by the “genomic revolution” have challenged both computational and experimental biologists to come up with new methods to decipher the secrets of the encoded proteins. The experimental biologists have largely concentrated on a variety of large-scale methods to assay gene expression and protein–protein interactions (Brown and Botstein 1999; Uetz et al. 2000; Walhout et al. 2000). The computational biologists, however, have deeply mined the genomes for evolutionary information in the form of homology between genes (Tatusov et al. 1997; Koonin et al. 2000; Ponting et al. 2000,). Over the past few years there has been increasing interest in the kinds of information that exist in the context in which a protein or a domain thereof is encoded in the genome (Mushegian and Koonin 1996; Dandekar et al. 1998). Recently, contextual information has been offered as a strong handle on the problem of in silico inference of protein function (Enright et al. 1999; Marcotte et al. 1999a,b; Overbeek et al. 1999; Pellegrini et al. 1999; Huynen et al. 2000). Understanding the scope and limitations of the use of these methods may be critical for the experimental biologists seeking to use computational guidelines for large-scale investigations of protein function. Here, we outline the recent advances in this direction and briefly illustrate the new leads they provide in understanding protein function. Contextual information comes in several overlapping grades, each with a different degree of specificity with regards to a particular protein’s role (Fig. 1). The most general form of contextual information is a phyletic profile, that is, the pattern of occurrence of orthologs of a particular gene in a set of genomes under comparison (Pellegrini et al. 1999; Tatusov et al. 2000). In this setup, the null hypothesis would be that genes that functionally interact in a particular pathway or complex would share a similar phyletic profile. This hypothesis is supported by the phyletic distribution of components of the core cellular machinery—the translation, transcription, and replication complexes, which interact very tightly—as well as those of metabolic pathways. For example, most of the proteins with a shared phyletic pattern between the archaea and the eukaryotes are components of one of the many protein complexes that have a role in the above-stated three-core cellular processes. Thus, the detection of uncharacterized proteins, such as the family typified by MJ0586 from Methanococcus jannaschii, with a similar phyletic profile would implicate them in one of the core functions (Fig. 1). When the information from sequence homology is applied to these proteins, one can often arrive at rather precise functions for these proteins. In the case of the MJ0586-like proteins, sequence comparisons reveal that they have a DNA binding helix-turn-helix domain (Aravind and Koonin 1999), suggesting that it is a component of the basal transcription machinery similar to TFIIB or TBP, which share the same phyletic profile (Fig. 1). This inference is compatible with the recent implication of the eukaryotic representative of this family MBF1 in transcriptional regulation (Kabe et al. 1999). When a rare shared phyletic pattern is seen for certain proteins whose sequence affinities suggest a related function, a strong case can be made for their interaction. One example of this is the typically eukaryotic chromatin protein methyltransferase—the SET domain that is seen in the bacteria Chlamydiae and Bordatella pertussis, along with another eukaryotic chromatin protein domain, SWIB (Stephens et al. 1998). The rare phyletic profile of these proteins in bacteria suggests that the SET and SWIB domains probably interact not only in these bacteria but possibly also in other organisms. Since the pioneering works of Jacob and Monod, scientists have realized that functionally linked genes are coregulated and occur in proximity to each other on the chromosome. Genome comparisons have supported this and show that, in prokaryotes, functionally interacting genes are in clusters that range from the giant ribosomal operons to gene pairs that survive over large phylogenetic distances (Dandekar et al. 1998; Overbeek et al. 1999). Thus, the occurrence of an uncharacterized gene in the neighborhood (the same operon) of genes with known functions could potentially betray its function. However, the variability of operons and the inability to predict them with a high level of certainty causes a loss of specificity of this form of contextual information. On this issue, Huynen et al. argue that greater stringency in the criteria for gene neighborhood reduces false positives in these inferences, with physical interactions between gene products strongly predicted by conservation of gene order in an operon. Furthermore, as the number of gene neighborhood combinations are far from exhausted with the currently available genomes, this method is likely to improve in its scope and confidence with the availability of more genomes in the future. E-MAIL aravind@ncbi.nlm.nih.gov; FAX (301) 480-9241. Insight/Outlook