A hierarchical Bayesian model for predicting the functional consequences of amino‐acid polymorphisms

Summary.  Genetic polymorphisms in deoxyribonucleic acid coding regions may have a phenotypic effect on the carrier, e.g. by influencing susceptibility to disease. Detection of deleterious mutations via association studies is hampered by the large number of candidate sites; therefore methods are needed to narrow down the search to the most promising sites. For this, a possible approach is to use structural and sequence‐based information of the encoded protein to predict whether a mutation at a particular site is likely to disrupt the functionality of the protein itself. We propose a hierarchical Bayesian multivariate adaptive regression spline (BMARS) model for supervised learning in this context and assess its predictive performance by using data from mutagenesis experiments on lac repressor and lysozyme proteins. In these experiments, about 12 amino‐acid substitutions were performed at each native amino‐acid position and the effect on protein functionality was assessed. The training data thus consist of repeated observations at each position, which the hierarchical framework is needed to account for. The model is trained on the lac repressor data and tested on the lysozyme mutations and vice versa. In particular, we show that the hierarchical BMARS model, by allowing for the clustered nature of the data, yields lower out‐of‐sample misclassification rates compared with both a BMARS and a frequen‐tist MARS model, a support vector machine classifier and an optimally pruned classification tree.

[1]  Refik Soyer,et al.  Bayesian Methods for Nonlinear Classification and Regression , 2004, Technometrics.

[2]  Christopher C. Holmes,et al.  Classification with Bayesian MARS , 2004, Machine Learning.

[3]  P. Scambler,et al.  Erratum: An I47L substitution in the HOXD13 homeodomain causes a novel human limb malformation by producing a selective loss of function (Development (2003) vol. 130 (1701-1712)) , 2003 .

[4]  P. Scambler,et al.  An I47L substitution in the HOXD13 homeodomain causes a novel human limb malformation by producing a selective loss of function , 2003, Development.

[5]  S. Kasif,et al.  Structural location of disease-associated single-nucleotide polymorphisms. , 2003, Journal of molecular biology.

[6]  A. Valencia,et al.  Automatic methods for predicting functionally important residues. , 2003, Journal of molecular biology.

[7]  Gerhard Klebe,et al.  Utilising structural knowledge in drug design strategies: applications using Relibase. , 2003, Journal of molecular biology.

[8]  E. Santagostino,et al.  Arg2074Cys missense mutation in the C2 domain of factor V causing moderately severe factor V deficiency: molecular characterization by expression of the recombinant protein. , 2003, Blood.

[9]  Christopher T. Saunders,et al.  Evaluation of structural and evolutionary contributions to deleterious mutation prediction. , 2002, Journal of molecular biology.

[10]  P. Bork,et al.  Human non-synonymous SNPs: server and survey. , 2002, Nucleic acids research.

[11]  D. Cooper,et al.  Assessing the relative importance of the biophysical properties of amino acid substitutions associated with human genetic disease , 2002, Human mutation.

[12]  S. Henikoff,et al.  Accounting for human polymorphisms predicted to affect protein function. , 2002, Genome research.

[13]  Richard A Goldstein,et al.  Why are proteins so robust to site mutations? , 2002, Journal of molecular biology.

[14]  J. Moult,et al.  SNPs, protein structure, and disease , 2001, Human mutation.

[15]  D. Chasman,et al.  Predicting the functional consequences of non-synonymous single nucleotide polymorphisms: structure-based assessment of amino acid variation. , 2001, Journal of molecular biology.

[16]  Warren C. Lathe,et al.  Prediction of deleterious human alleles. , 2001, Human molecular genetics.

[17]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[18]  P. Bork,et al.  Towards a structural basis of human non-synonymous single nucleotide polymorphisms. , 2000, Trends in genetics : TIG.

[19]  P. Bartlett,et al.  Probabilities for SV Machines , 2000 .

[20]  D. Ledbetter,et al.  Subcortical band heterotopia in rare affected males can be caused by missense mutations in DCX (XLIS) or LIS1. , 1999, Human molecular genetics.

[21]  E. Lander,et al.  Characterization of single-nucleotide polymorphisms in coding regions of human genes , 1999, Nature Genetics.

[22]  Shinichi Morishita,et al.  On Classification and Regression , 1998, Discovery Science.

[23]  David G. T. Denison,et al.  Bayesian MARS , 1998, Stat. Comput..

[24]  P. Green,et al.  Corrigendum: On Bayesian analysis of mixtures with an unknown number of components , 1997 .

[25]  Christopher M. Bishop,et al.  Classification and regression , 1997 .

[26]  P. Green,et al.  On Bayesian Analysis of Mixtures with an Unknown Number of Components (with discussion) , 1997 .

[27]  Temple F. Smith,et al.  Multiple domain protein diagnostic patterns , 1996, Protein science : a publication of the Protein Society.

[28]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[29]  P. Green Reversible jump Markov chain Monte Carlo computation and Bayesian model determination , 1995 .

[30]  C. J. Stone,et al.  Polychotomous Regression , 1995 .

[31]  L. Tierney Markov Chains for Exploring Posterior Distributions , 1994 .

[32]  D. Madigan,et al.  Model Selection and Accounting for Model Uncertainty in Graphical Models Using Occam's Window , 1994 .

[33]  C Cruz,et al.  Genetic studies of the lac repressor. XIV. Analysis of 4000 altered Escherichia coli lac repressors reveals essential and non-essential residues, as well as "spacers" which do not require a specific sequence. , 1994, Journal of molecular biology.

[34]  S. Bouvier,et al.  Systematic mutation of bacteriophage T4 lysozyme. , 1991, Journal of molecular biology.

[35]  J. Friedman Multivariate adaptive regression splines , 1990 .

[36]  C. Sander,et al.  Database of homology‐derived protein structures and the structural meaning of sequence alignment , 1991, Proteins.

[37]  J H Miller,et al.  Genetic studies of the lac repressor. I. Correlation of mutational sites with specific amino acid residues: construction of a colinear gene-protein map. , 1977, Journal of molecular biology.