Exploring the limits of nearest neighbour secondary structure prediction.

This paper presents a simple and robust secondary structure prediction scheme (SIMPA96) based on an updated version of the nearest neighbour method. Using a larger database of known structures, the Blosum 62 substitution matrix and a regularization algorithm, the three state prediction accuracy is increased by 4.7 percentage points to 67.7% for a single sequence and up to 72.8% when using multiple alignments. The increase in prediction accuracy with respect to the previous version can be almost entirely ascribed to the sevenfold increase in the size of the database. A more detailed analysis of the results shows that badly predicted regions of a protein sequence are randomly distributed throughout the database and that the goal of perfect secondary structure predictions by methods which use only local sequence information is illusory.