Physicochemical constraint violation by missense substitutions mediates impairment of protein function and disease severity.

We find that the degree of impairment of protein function by missense variants is predictable by comparative sequence analysis alone. The applicable range of impairment is not confined to binary predictions that distinguish normal from deleterious variants, but extends continuously from mild to severe effects. The accuracy of predictions is strongly dependent on sequence variation and is highest when diverse orthologs are available. High predictive accuracy is achieved by quantification of the physicochemical characteristics in each position of the protein, based on observed evolutionary variation. The strong relationship between physicochemical characteristics of a missense variant and impairment of protein function extends to human disease. By using four diverse proteins for which sufficient comparative sequence data are available, we show that grades of disease, or likelihood of developing cancer, correlate strongly with physicochemical constraint violation by causative amino acid variants.

[1]  F. Young Biochemistry , 1955, The Indian Medical Gazette.

[2]  A. Zamyatnin,et al.  Protein volume in solution. , 1972, Progress in biophysics and molecular biology.

[3]  R. Doolittle,et al.  A simple method for displaying the hydropathic character of a protein. , 1982, Journal of molecular biology.

[4]  S F Altschul,et al.  Weights for data related by a tree. , 1989, Journal of molecular biology.

[5]  Marianne Manchester,et al.  Complete mutagenesis of the HIV-1 protease , 1989, Nature.

[6]  S. Bouvier,et al.  Systematic mutation of bacteriophage T4 lysozyme. , 1991, Journal of molecular biology.

[7]  D. Labie,et al.  Molecular Evolution , 1991, Nature.

[8]  T. Vulliamy,et al.  Variants of glucose‐6‐phosphate dehydrogenase are due to missense mutations spread throughout the coding region of the gene , 1993, Human mutation.

[9]  P. Jeffrey,et al.  Crystal structure of a p53 tumor suppressor-DNA complex: understanding tumorigenic mutations. , 1994, Science.

[10]  V. Muñoz,et al.  Intrinsic secondary structure propensities of the amino acids, using statistical ϕ–ψ matrices: Comparison with experimental scales , 1994 .

[11]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[12]  C Cruz,et al.  Genetic studies of the lac repressor. XIV. Analysis of 4000 altered Escherichia coli lac repressors reveals essential and non-essential residues, as well as "spacers" which do not require a specific sequence. , 1994, Journal of molecular biology.

[13]  J. Coffin,et al.  HIV population dynamics in vivo: implications for genetic variation, pathogenesis, and therapy , 1995, Science.

[14]  David S. Latchman,et al.  Biochemistry (4th edn) , 1995 .

[15]  G. Chang,et al.  Crystal Structure of the Lactose Operon Repressor and Its Complexes with DNA and Inducer , 1996, Science.

[16]  Jeffrey Miller,et al.  Genetic Studies of Lac Repressor: 4000 Single Amino Acid Substitutions and Analysis of the Resulting Phenotypes on the Basis of the Protein Structure , 1996, German Conference on Bioinformatics.

[17]  G J Pielak,et al.  A genetic approach for identifying critical residues in the fingers and palm subdomains of HIV-1 reverse transcriptase. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Wojciech Makalowski,et al.  Evolutionary conservation and somatic mutation hotspot maps of p53: correlation with p53 protein structural and functional features , 1999, Oncogene.

[19]  Bryan Chan,et al.  Human immunodeficiency virus reverse transcriptase and protease sequence database , 2003, Nucleic Acids Res..

[20]  S. Henikoff,et al.  Predicting deleterious amino acid substitutions. , 2001, Genome research.

[21]  Tal Pupko,et al.  A structural EM algorithm for phylogenetic inference , 2001, J. Comput. Biol..

[22]  Warren C. Lathe,et al.  Prediction of deleterious human alleles. , 2001, Human molecular genetics.

[23]  M. Miller,et al.  Understanding human disease mutations through the use of interspecific genetic variation. , 2001, Human molecular genetics.

[24]  Arend Sidow,et al.  Inference of functional regions in proteins by quantification of evolutionary constraints , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[25]  P. Bork,et al.  Human non-synonymous SNPs: server and survey. , 2002, Nucleic acids research.

[26]  George P Patrinos,et al.  HbVar: A relational database of human hemoglobin variants and thalassemia mutations at the globin gene server , 2002, Human mutation.

[27]  S. Henikoff,et al.  Accounting for human polymorphisms predicted to affect protein function. , 2002, Genome research.

[28]  C. Harris,et al.  The IARC TP53 database: New online mutation analysis and recommendations to users , 2002, Human mutation.

[29]  Andrew C R Martin,et al.  G6PDdb, an integrated database of glucose‐6‐phosphate dehydrogenase (G6PD) mutations , 2002, Human mutation.

[30]  Teri E. Klein,et al.  The functional importance of disease-associated mutation , 2002, BMC Bioinformatics.

[31]  David R. Westhead,et al.  A comparative study of machine-learning methods to predict the effects of single nucleotide polymorphisms on protein function , 2003, Bioinform..

[32]  D. Botstein,et al.  Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease , 2003, Nature Genetics.

[33]  P. Stenson,et al.  Human Gene Mutation Database (HGMD®): 2003 update , 2003, Human mutation.

[34]  P. Stenson,et al.  Human Gene Mutation Database (HGMD , 2003 .

[35]  Alberto Riva,et al.  Bayesian approach to discovering pathogenic SNPs in conserved protein domains , 2004, Human mutation.

[36]  Albert Y Lau,et al.  Functional classification of proteins and protein variants. , 2004, Proceedings of the National Academy of Sciences of the United States of America.