Protein sequence-structure compatibility criteria in terms of statistical hypothesis testing.

The assignment of query protein sequences to probable folds in a threading approach is based on the statistical analysis (learning) of structural properties of amino acids in known protein structures. We formalize the recognition problem in terms of mathematical statistics, namely statistical hypothesis testing. Our general formulation leads to various mathematical forms of a decision rule function for evaluation of the quality of a sequence-structure fit. Three criteria were derived according to a likelihood ratio approach. Two of them have new functional forms while the third happens to coincide with the mean force potential function previously derived under the additional assumption of the Boltzmann law. New decision rule functions employ (i) the Parzen estimator of a probability density and (ii) the newly introduced non-parametric statistic with known asymptotic distribution. We compared criteria efficiency by a 'structure seeks sequence' search for three highly populated template folds through a query library of non-homologous sequences of proteins with known 3D structure using residue accessibility as an environmental variable. Various criteria reflect different underlying statistical propositions and thus often recognize diverse correct sequence-structure matches. On the other hand, if an amino acid sequence is recognized as compatible with a template by each of three decision rules it appears that one can make a more reliable inference of sequence-structure relationship since almost all false positives obtained by the three criteria differ.