The prediction of protein contacts from multiple sequence alignments.

We have studied the question of how much extra predictive power the correlated mutational behaviour of pairs of amino acid residues separated along a sequence has concerning the likelihood of those residues being in contact in the folded protein. The mutational behaviour is deduced from multiple sequence alignments. Our findings are that there is, indeed, some valuable information available from this source and that it is sufficient to make a significant improvement in our ability to predict contacts, when compared with earlier methods that do not take into account the correlations between the mutations. This improvement is approximately twice as large as can be obtained by the more economical method of simply averaging pair preferences over the same sequence alignment. Even when using a method based on pair preferences, a further significant improvement can be made by penalizing more variable regions (on the reasonable assumption that invariant residues are relatively more likely to be in contact), though we have found no way of improving the pair preference method to the extent that it matches the method based on correlated behaviour. Our new method is thought to be the best data-based method of contact prediction developed so far, achieving, on average, an improvement over a random (i.e. information-free) prediction of a factor of five when the number of contacts predicted is chosen to match the number that actually occur.