STRIKE: evaluation of protein MSAs using a single 3D structure

Motivation: Evaluating alternative multiple protein sequence alignments is an important unsolved problem in Biology. The most accurate way of doing this is to use structural information. Unfortunately, most methods require at least two structures to be embedded in the alignment, a condition rarely met when dealing with standard datasets. Result: We developed STRIKE, a method that determines the relative accuracy of two alternative alignments of the same sequences using a single structure. We validated our methodology on three commonly used reference datasets (BAliBASE, Homestrad and Prefab). Given two alignments, STRIKE manages to identify the most accurate one in 70% of the cases on average. This figure increases to 79% when considering very challenging datasets like the RV11 category of BAliBASE. This discrimination capacity is significantly higher than that reported for other metrics such as Contact Accepted mutation or Blosum. We show that this increased performance results both from a refined definition of the contacts and from the use of an improved contact substitution score. Contact: cedric.notredame@crg.eu Availability: STRIKE is an open source freeware available from www.tcoffee.org Supplementary Information: Supplementary data are available at Bioinformatics online.

[1]  T L Blundell,et al.  FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. , 2001, Journal of molecular biology.

[2]  Cédric Notredame,et al.  3DCoffee: combining protein sequences and structures within multiple sequence alignments. , 2004, Journal of molecular biology.

[3]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[4]  Iain M. Wallace,et al.  M-Coffee: combining multiple sequence alignment methods with T-Coffee , 2006, Nucleic acids research.

[5]  Olivier Poch,et al.  RASCAL: Rapid Scanning and Correction of Multiple Sequence Alignments , 2003, Bioinform..

[6]  J. Skolnick,et al.  Automated structure prediction of weakly homologous proteins on a genomic scale. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Patrice Koehl,et al.  The ASTRAL Compendium in 2004 , 2003, Nucleic Acids Res..

[8]  Chuong B. Do,et al.  ProbCons: Probabilistic consistency-based multiple sequence alignment. , 2005, Genome research.

[9]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[10]  Jean-François Gibrat,et al.  Can molecular dynamics simulations help in discriminating correct from erroneous protein 3D models? , 2008, BMC Bioinformatics.

[11]  D. T. Jones,et al.  A new approach to protein fold recognition , 1992, Nature.

[12]  D. Eisenberg,et al.  A method to identify protein sequences that fold into a known three-dimensional structure. , 1991, Science.

[13]  S. Altschul,et al.  The compositional adjustment of amino acid substitution matrices , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[14]  M. Suchard,et al.  Alignment Uncertainty and Genomic Analysis , 2008, Science.

[15]  Olivier Poch,et al.  BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark , 2005, Proteins.

[16]  Dmitri Petrov,et al.  High sensitivity to aligner and high rate of false positives in the estimates of positive selection in the 12 Drosophila genomes. , 2011, Genome research.

[17]  Jean-Michel Claverie,et al.  CaspR: a web server for automated molecular replacement using homology modelling , 2004, Nucleic Acids Res..

[18]  Jimin Pei,et al.  PCMA: fast and accurate multiple sequence alignment based on profile consistency , 2003, Bioinform..

[19]  S. Altschul Amino acid substitution matrices from an information theoretic perspective , 1991, Journal of Molecular Biology.

[20]  M. L. Connolly Solvent-accessible surfaces of proteins and nucleic acids. , 1983, Science.

[21]  John P. Overington,et al.  HOMSTRAD: A database of protein structure alignments for homologous families , 1998, Protein science : a publication of the Protein Society.

[22]  Olivier Poch,et al.  AlexSys: a knowledge-based expert system for multiple sequence alignment construction and analysis , 2010, Nucleic acids research.

[23]  Erik L. L. Sonnhammer,et al.  Automatic assessment of alignment quality , 2005, Nucleic acids research.

[24]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[25]  D. Eisenberg,et al.  Assessment of protein models with three-dimensional profiles , 1992, Nature.

[26]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[27]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[28]  Jean-François Gibrat,et al.  FROST: A filter‐based fold recognition method , 2002, Proteins.

[29]  Christopher J. Lee,et al.  Combining partial order alignment and progressive multiple sequence alignment increases alignment speed and scalability to very large alignment problems , 2004, Bioinform..

[30]  S. Henikoff,et al.  Amino acid substitution matrices. , 2000, Advances in protein chemistry.

[31]  Kuang Lin,et al.  Testing homology with Contact Accepted mutatiOn (CAO): a contact-based Markov model of protein evolution , 2003, Comput. Biol. Chem..

[32]  M. O. Dayhoff A model of evolutionary change in protein , 1978 .

[33]  M. Sippl Recognition of errors in three‐dimensional structures of proteins , 1993, Proteins.

[34]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[35]  K. Katoh,et al.  MAFFT version 5: improvement in accuracy of multiple sequence alignment , 2005, Nucleic acids research.

[36]  Sitao Wu,et al.  MUSTER: Improving protein sequence profile–profile alignments by using multiple sources of structure information , 2008, Proteins.

[37]  Julie D Thompson,et al.  Multiple Sequence Alignment Using ClustalW and ClustalX , 2003, Current protocols in bioinformatics.

[38]  Ellen J. Bass,et al.  Improving pairwise sequence alignment accuracy using near-optimal protein sequence alignments , 2010, BMC Bioinformatics.

[39]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.