Linear programming optimization and a double statistical filter for protein threading protocols

The design of scoring functions (or potentials) for threading, differentiating native‐like from non‐native structures with a limited computational cost, is an active field of research. We revisit two widely used families of threading potentials: the pairwise and profile models. To design optimal scoring functions we use linear programming (LP). The LP protocol makes it possible to measure the difficulty of a particular training set in conjunction with a specific form of the scoring function. Gapless threading demonstrates that pair potentials have larger prediction capacity compared with profile energies. However, alignments with gaps are easier to compute with profile potentials. We therefore search and propose a new profile model with comparable prediction capacity to contact potentials. A protocol to determine optimal energy parameters for gaps, using LP, is also presented. A statistical test, based on a combination of local and global Z‐scores, is employed to filter out false‐positives. Extensive tests of the new protocol are presented. The new model provides an efficient alternative for threading with pair energies, maintaining comparable accuracy. The code, databases, and a prediction server are available at http://www.tc.cornell.edu/CBIO/loopp. Proteins 2001;45:241–261. © 2001 Wiley‐Liss, Inc.

[1]  E. J. Gumbel,et al.  Statistics of Extremes. , 1960 .

[2]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[3]  W. Fitch Random sequences. , 1983, Journal of molecular biology.

[4]  S. Altschul,et al.  Significance of nucleotide sequence alignments: a method for random sequence permutation that preserves dinucleotide and codon usage. , 1985, Molecular biology and evolution.

[5]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[6]  S. B. Needleman,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 1989 .

[7]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[9]  Renato D. C. Monteiro,et al.  Limiting behavior of the affine scaling continuous trajectories for linear programming problems , 1991, Math. Program..

[10]  D. Eisenberg,et al.  A method to identify protein sequences that fold into a known three-dimensional structure. , 1991, Science.

[11]  M. Sippl,et al.  Detection of native‐like models for amino acid sequences of unknown three‐dimensional structure in a data base of known protein conformations , 1992, Proteins.

[12]  M. Levitt,et al.  A lattice model for protein structure prediction at low resolution. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[13]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[14]  D. T. Jones,et al.  A new approach to protein fold recognition , 1992, Nature.

[15]  G. Crippen,et al.  Contact potential that recognizes the correct folding of globular proteins. , 1992, Journal of molecular biology.

[16]  A. Godzik,et al.  Topology fingerprint approach to the inverse protein folding problem. , 1992, Journal of molecular biology.

[17]  John P. Overington,et al.  Alignment and searching for common protein folds using a data bank of structural templates. , 1993, Journal of molecular biology.

[18]  S. Bryant,et al.  An empirical energy function for threading protein sequence through the folding motif , 1993, Proteins.

[19]  C Sander,et al.  Prediction of protein structure by evaluation of sequence-structure fitness. Aligning sequences to contact profiles derived from three-dimensional structures. , 1993, Journal of molecular biology.

[20]  C. Sander,et al.  The FSSP database of structurally aligned protein fold families. , 1994, Nucleic acids research.

[21]  M. Levitt,et al.  Exploring conformational space with a simple lattice model for protein structure. , 1994, Journal of molecular biology.

[22]  R. Lathrop The protein threading problem with sequence amino acid interaction preferences is NP-complete. , 1994, Protein engineering.

[23]  Y. Matsuo,et al.  Protein structural similarities predicted by a sequence‐structure compatibility method , 1994, Protein science : a publication of the Protein Society.

[24]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[25]  A. Godzik,et al.  Are proteins ideal mixtures of amino acids? Analysis of energy parameter sets , 1995, Protein science : a publication of the Protein Society.

[26]  S. Bryant,et al.  Statistics of sequence-structure threading. , 1995, Current opinion in structural biology.

[27]  C. Mészáros Fast Cholesky factorization for interior point methods of linear programming , 1996 .

[28]  S. Bryant Evaluation of threading specificity and accuracy , 1996, Proteins.

[29]  D. Fischer,et al.  A study of combined structure/sequence profiles. , 1996, Folding & design.

[30]  R. Jernigan,et al.  Residue-residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading. , 1996, Journal of molecular biology.

[31]  Richard A. Goldstein,et al.  THE STATISTICAL MECHANICAL BASIS OF SEQUENCE ALIGNMENT ALGORITHMS FOR PROTEIN STRUCTURE RECOGNITION , 1996 .

[32]  A Elofsson,et al.  Assessing the performance of fold recognition methods by means of a comprehensive benchmark. , 1996, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[33]  Temple F. Smith,et al.  Global optimum protein threading with gapped alignment and empirical pair score functions. , 1996, Journal of molecular biology.

[34]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[35]  P. Stadler,et al.  Neutral networks in protein space: a computational study based on knowledge-based potentials of mean force. , 1997, Folding & design.

[36]  W R Taylor,et al.  Multiple sequence threading: conditional gap placement. , 1997, Folding & design.

[37]  A. Liwo,et al.  A united‐residue force field for off‐lattice protein‐structure simulations. I. Functional forms and parameters of long‐range side‐chain interaction potentials from protein crystal data , 1997 .

[38]  Adam Liwo,et al.  A united-residue force field for off-lattice protein-structure simulations. I. Functional forms and parameters of long-range side-chain interaction potentials from protein crystal data , 1997, J. Comput. Chem..

[39]  L. Mirny,et al.  Protein structure prediction by threading. Why it works and why it does not. , 1998, Journal of molecular biology.

[40]  W. Pearson Empirical statistical estimates for sequence similarity searches. , 1998, Journal of molecular biology.

[41]  E. Domany,et al.  Pairwise contact potentials are unsuitable for protein folding , 1998 .

[42]  D. Thirumalai,et al.  Pair potentials for protein folding: Choice of reference states and sensitivity of predicted native states to variations in the interaction schemes , 2008, Protein science : a publication of the Protein Society.

[43]  M J Sternberg,et al.  Progress in protein structure prediction: assessment of CASP3. , 1999, Current opinion in structural biology.

[44]  David C. Jones,et al.  GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. , 1999, Journal of molecular biology.

[45]  N. Linial,et al.  On the design and analysis of protein folding potentials , 2000, Proteins.

[46]  A. Panchenko,et al.  Combination of threading potentials and sequence profiles improves fold recognition. , 2000, Journal of molecular biology.

[47]  T. C. Nesbitt,et al.  fw2.2: a quantitative trait locus key to the evolution of tomato fruit size. , 2000, Science.

[48]  R Samudrala,et al.  Ab initio construction of protein tertiary structures using a hierarchical approach. , 2000, Journal of molecular biology.

[49]  Roland L. Dunbrack,et al.  CAFASP2: The second critical assessment of fully automated structure prediction methods , 2001, Proteins.

[50]  A. Lapedes,et al.  Exploring protein sequence space using knowledge-based potentials. , 2001, Journal of theoretical biology.

[51]  Ron Elber,et al.  Maximum feasibility guideline in the design and analysis of protein folding potentials , 2002, J. Comput. Chem..