A multivariate analysis method for discriminating protein secondary structural segments.

Using discriminant analysis, three types of protein secondary structure segments--helices, beta-strands and coils--are discriminated by amino acid sequence information alone. A variable in the discriminant analysis is defined by the amino acid index used to represent the sequence data and by the calculation method used to extract a feature in this representation. Thus, the three types of secondary structure segments derived from a set of non-homologous proteins from the Protein Data Bank are analyzed by 888 variables, which correspond to the mean, standard deviation, 3.6-residue periodicity and 2-residue periodicity for the numerical profiles determined from 222 published amino acid indices. These variables are combined to obtain best discrimination of the three types of segments. When up to three variables are combined, the best discrimination rate was 75%. The variables selected consist of the mean of alpha propensity (or turn propensity), the mean of beta propensity, and the 3.6-residue periodicity of hydrophobicity. This variable selection procedure can also be applied to other types of discrimination problem, once groups of sequence data are properly organized.