Validation of Coevolving Residue Algorithms via Pipeline Sensitivity Analysis: ELSC and OMES and ZNMI, Oh My!

Correlated amino acid substitution algorithms attempt to discover groups of residues that co-fluctuate due to either structural or functional constraints. Although these algorithms could inform both ab initio protein folding calculations and evolutionary studies, their utility for these purposes has been hindered by a lack of confidence in their predictions due to hard to control sources of error. To complicate matters further, naive users are confronted with a multitude of methods to choose from, in addition to the mechanics of assembling and pruning a dataset. We first introduce a new pair scoring method, called ZNMI (Z-scored-product Normalized Mutual Information), which drastically improves the performance of mutual information for co-fluctuating residue prediction. Second and more important, we recast the process of finding coevolving residues in proteins as a data-processing pipeline inspired by the medical imaging literature. We construct an ensemble of alignment partitions that can be used in a cross-validation scheme to assess the effects of choices made during the procedure on the resulting predictions. This pipeline sensitivity study gives a measure of reproducibility (how similar are the predictions given perturbations to the pipeline?) and accuracy (are residue pairs with large couplings on average close in tertiary structure?). We choose a handful of published methods, along with ZNMI, and compare their reproducibility and accuracy on three diverse protein families. We find that (i) of the algorithms tested, while none appear to be both highly reproducible and accurate, ZNMI is one of the most accurate by far and (ii) while users should be wary of predictions drawn from a single alignment, considering an ensemble of sub-alignments can help to determine both highly accurate and reproducible couplings. Our cross-validation approach should be of interest both to developers and end users of algorithms that try to detect correlated amino acid substitutions.

[1]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[2]  C. Sander,et al.  Correlated mutations and residue contacts in proteins , 1994, Proteins.

[3]  L. K. Hansen,et al.  The Quantitative Evaluation of Functional Neuroimaging Experiments: The NPAIRS Data Analysis Framework , 2000, NeuroImage.

[4]  Lars Kai Hansen,et al.  The Quantitative Evaluation of Functional Neuroimaging Experiments: The NPAIRS Data Analysis Framework , 2000, NeuroImage.

[5]  Haim Ashkenazy,et al.  Optimal data collection for correlated mutation analysis , 2009, Proteins.

[6]  Jasbir S. Arora,et al.  Optimization of structural and mechanical systems , 2007 .

[7]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[8]  Martin Niethammer,et al.  Supramodular structure and synergistic target binding of the N-terminal tandem PDZ domains of PSD-95. , 2003, Journal of molecular biology.

[9]  Piero Fariselli,et al.  On the Upper Bound of the Prediction Accuracy of Residue Contacts in Proteins with Correlated Mutations: The Case Study of the Similarity Matrices , 2009, WABI.

[10]  M. Newman,et al.  Finding community structure in networks using the eigenvectors of matrices. , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.

[11]  Michael T. Laub,et al.  Rewiring the Specificity of Two-Component Signal Transduction Systems , 2008, Cell.

[12]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[13]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[14]  L. K. Hansen,et al.  Activation pattern reproducibility: Measuring the effects of group size and data analysis models , 1997, Human brain mapping.

[15]  Mark Newman,et al.  Detecting community structure in networks , 2004 .

[16]  Rob Knight,et al.  Detecting coevolution without phylogenetic trees? Tree-ignorant metrics of coevolution perform as well as tree-aware metrics , 2008, BMC Evolutionary Biology.

[17]  D. Baker,et al.  A surprising simplicity to protein folding , 2000, Nature.

[18]  Gebhard F. X. Schertler,et al.  Structure of a β1-adrenergic G-protein-coupled receptor , 2008, Nature.

[19]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[20]  P. Jaccard,et al.  Etude comparative de la distribution florale dans une portion des Alpes et des Jura , 1901 .

[21]  Daniel Y. Little,et al.  Identification of Coevolving Residues and Coevolution Potentials Emphasizing Structure, Bond Formation and Catalytic Coordination in Protein Evolution , 2009, PloS one.

[22]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[23]  A. Horovitz,et al.  Mapping pathways of allosteric communication in GroEL by analysis of correlated mutations , 2002, Proteins.

[24]  J. H. Pereira,et al.  Structure of chorismate synthase from Mycobacterium tuberculosis. , 2006, Journal of structural biology.

[25]  W. Atchley,et al.  Correlations among amino acid sites in bHLH protein domains: an information theoretic analysis. , 2000, Molecular biology and evolution.

[26]  J. Schmid,et al.  Only the Mature Form of the Plastidic Chorismate Synthase Is Enzymatically Active , 1995, Plant physiology.

[27]  R. Aldrich,et al.  Influence of conservation on calculations of amino acid covariance in multiple sequence alignments , 2004, Proteins.

[28]  Richard W. Aldrich,et al.  A perturbation-based method for calculating explicit likelihood of evolutionary co-variance in multiple sequence alignments , 2004, Bioinform..

[29]  Stefano Costanzi,et al.  Computing Highly Correlated Positions Using Mutual Information and Graph Theory for G Protein-Coupled Receptors , 2009, PloS one.

[30]  Yul-Wan Sung,et al.  Functional magnetic resonance imaging , 2004, Scholarpedia.

[31]  R. Ranganathan,et al.  Evolutionarily conserved pathways of energetic connectivity in protein families. , 1999, Science.

[32]  Lars Kai Hansen,et al.  Optimizing the fMRI data-processing pipeline using prediction and reproducibility performance metrics: I. A preliminary group analysis , 2004, NeuroImage.

[33]  Richard W Aldrich,et al.  On Evolutionary Conservation of Thermodynamic Coupling in Proteins* , 2004, Journal of Biological Chemistry.

[34]  Leon Danon,et al.  Comparing community structure identification , 2005, cond-mat/0505245.

[35]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[36]  J. Lee,et al.  Binding sites in Escherichia coli dihydrofolate reductase communicate by modulating the conformational ensemble. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[37]  L. Shah,et al.  Functional magnetic resonance imaging. , 2010, Seminars in roentgenology.

[38]  J. Coggins,et al.  The overexpression, purification and complete amino acid sequence of chorismate synthase from Escherichia coli K12 and its comparison with the enzyme from Neurospora crassa. , 1988, The Biochemical journal.

[39]  Graziano Pesole,et al.  Correlated substitution analysis and the prediction of amino acid structural contacts , 2007, Briefings Bioinform..

[40]  L. C. Martin,et al.  Using information theory to search for co-evolving residues in proteins , 2005, Bioinform..

[41]  Najeeb M. Halabi,et al.  Protein Sectors: Evolutionary Units of Three-Dimensional Structure , 2009, Cell.

[42]  Carolina Perez-Iratxeta,et al.  Towards completion of the Earth's proteome , 2007, EMBO reports.

[43]  Gregory B. Gloor,et al.  Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction , 2008, Bioinform..

[44]  Joël Janin,et al.  Crystal Structure of the Bifunctional Chorismate Synthase from Saccharomyces cerevisiae* , 2003, Journal of Biological Chemistry.

[45]  Alan C. Evans,et al.  A Three-Dimensional Statistical Analysis for CBF Activation Studies in Human Brain , 1992, Journal of cerebral blood flow and metabolism : official journal of the International Society of Cerebral Blood Flow and Metabolism.

[46]  Tord Snäll,et al.  Reassessing a sparse energetic network within a single protein domain , 2008, Proceedings of the National Academy of Sciences.

[47]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[48]  Achille Messac,et al.  MULTIOBJECTIVE OPTIMIZATION: CONCEPTS AND METHODS , 2007 .

[49]  Karl J. Friston,et al.  Comparing Functional (PET) Images: The Assessment of Significant Change , 1991, Journal of cerebral blood flow and metabolism : official journal of the International Society of Cerebral Blood Flow and Metabolism.