Predicting and improving the protein sequence alignment quality by support vector regression

BackgroundFor successful protein structure prediction by comparative modeling, in addition to identifying a good template protein with known structure, obtaining an accurate sequence alignment between a query protein and a template protein is critical. It has been known that the alignment accuracy can vary significantly depending on our choice of various alignment parameters such as gap opening penalty and gap extension penalty. Because the accuracy of sequence alignment is typically measured by comparing it with its corresponding structure alignment, there is no good way of evaluating alignment accuracy without knowing the structure of a query protein, which is obviously not available at the time of structure prediction. Moreover, there is no universal alignment parameter option that would always yield the optimal alignment.ResultsIn this work, we develop a method to predict the quality of the alignment between a query and a template. We train the support vector regression (SVR) models to predict the MaxSub scores as a measure of alignment quality. The alignment between a query protein and a template of length n is transformed into a (n + 1)-dimensional feature vector, then it is used as an input to predict the alignment quality by the trained SVR model. Performance of our work is evaluated by various measures including Pearson correlation coefficient between the observed and predicted MaxSub scores. Result shows high correlation coefficient of 0.945. For a pair of query and template, 48 alignments are generated by changing alignment options. Trained SVR models are then applied to predict the MaxSub scores of those and to select the best alignment option which is chosen specifically to the query-template pair. This adaptive selection procedure results in 7.4% improvement of MaxSub scores, compared to those when the single best parameter option is used for all query-template pairs.ConclusionThe present work demonstrates that the alignment quality can be predicted with reasonable accuracy. Our method is useful not only for selecting the optimal alignment parameters for a chosen template based on predicted alignment quality, but also for filtering out problematic templates that are not suitable for structure prediction due to poor alignment accuracy. This is implemented as a part in FORECAST, the server for fold-recognition and is freely available on the web at http://pbil.kaist.ac.kr/forecast

[1]  T. Blundell,et al.  Comparative protein modelling by satisfaction of spatial restraints. , 1993, Journal of molecular biology.

[2]  Chan-seok Jeong,et al.  Fold recognition by combining profile-profile alignment and support vector machine , 2005, Bioinform..

[3]  P. Bartlett,et al.  Probabilities for SV Machines , 2000 .

[4]  Jens Meiler,et al.  Rosetta predictions in CASP5: Successes, failures, and prospects for complete automation , 2003, Proteins.

[5]  Patrice Koehl,et al.  The ASTRAL Compendium in 2004 , 2003, Nucleic Acids Res..

[6]  Alfonso Valencia,et al.  Predicting reliable regions in protein alignments from sequence profiles. , 2003, Journal of molecular biology.

[7]  A. Sali,et al.  Comparative protein structure modeling by iterative alignment, model building and model assessment. , 2003, Nucleic Acids Research.

[8]  D J Osguthorpe,et al.  Refined models for computer simulation of protein folding. Applications to the study of conserved secondary structure and flexible hinge points during the folding of pancreatic trypsin inhibitor. , 1979, Journal of molecular biology.

[9]  N. Alexandrov,et al.  SARFing the PDB. , 1996, Protein engineering.

[10]  Alejandro A. Schäffer,et al.  IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices , 1999, Bioinform..

[11]  Liam J. McGuffin,et al.  The Genomic Threading Database: a comprehensive resource for structural annotations of the genomes from key organisms , 2004, Nucleic Acids Res..

[12]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[13]  Osvaldo Olmea,et al.  MAMMOTH (Matching molecular models obtained from theory): An automated method for model comparison , 2002, Protein science : a publication of the Protein Society.

[14]  W A Koppensteiner,et al.  Automated large scale evaluation of protein structure predictions , 1999, Proteins.

[15]  Ceslovas Venclovas,et al.  Progress over the first decade of CASP experiments , 2005, Proteins.

[16]  Arne Elofsson,et al.  A study of quality measures for protein threading models , 2001, BMC Bioinformatics.

[17]  David C. Jones,et al.  Progress in protein structure prediction. , 1997, Current opinion in structural biology.

[18]  Arne Elofsson,et al.  Using evolutionary information for the query and target improves fold recognition , 2004, Proteins.

[19]  Jinbo Xu Fold recognition by predicted alignment accuracy , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[20]  A. Sali,et al.  Comparative protein structure modeling of genes and genomes. , 2000, Annual review of biophysics and biomolecular structure.

[21]  S H Bryant,et al.  A measure of progress in fold recognition? , 1999, Proteins.

[22]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[23]  András Fiser,et al.  Multiple mapping method: A novel approach to the sequence‐to‐structure alignment problem in comparative protein structure modeling , 2006, Proteins.

[24]  L. Holm,et al.  Exhaustive enumeration of protein domain families. , 2003, Journal of molecular biology.

[25]  A. Sali,et al.  Statistical potentials for fold assessment , 2009 .

[26]  Christophe G. Lambert,et al.  ESyPred3D: Prediction of proteins 3D structures , 2002, Bioinform..

[27]  Roland L. Dunbrack,et al.  CAFASP2: The second critical assessment of fully automated structure prediction methods , 2001, Proteins.

[28]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[29]  Golan Yona,et al.  Within the twilight zone: a sensitive profile-profile comparison tool based on information theory. , 2002, Journal of molecular biology.

[30]  Yang Zhang,et al.  Scoring function for automated assessment of protein structure template quality , 2004, Proteins.

[31]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[32]  Arne Elofsson,et al.  Profile–profile methods provide improved fold‐recognition: A study of different profile–profile alignment methods , 2004, Proteins.

[33]  Leszek Rychlewski,et al.  LiveBench‐8: The large‐scale, continuous assessment of automated protein structure prediction , 2005, Protein science : a publication of the Protein Society.

[34]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[35]  Paul W. Fitzjohn,et al.  In silico protein recombination: enhancing template and sequence alignment selection for comparative protein modelling. , 2003, Journal of molecular biology.

[36]  Lisa N Kinch,et al.  CASP5 assessment of fold recognition target predictions , 2003, Proteins.

[37]  Alexander J. Smola,et al.  Advances in Large Margin Classifiers , 2000 .

[38]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[39]  P E Bourne,et al.  Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. , 1998, Protein engineering.

[40]  Sandor Vajda,et al.  Consensus alignment for reliable framework prediction in homology modeling , 2003, Bioinform..

[41]  A. Godzik,et al.  Comparison of sequence profiles. Strategies for structural predictions using sequence information , 2008, Protein science : a publication of the Protein Society.

[42]  C Venclovas,et al.  Processing and analysis of CASP3 protein structure predictions , 1999, Proteins.

[43]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[45]  B. Rost Twilight zone of protein sequence alignments. , 1999, Protein engineering.

[46]  Arne Elofsson,et al.  MaxSub: an automated measure for the assessment of protein structure prediction quality , 2000, Bioinform..