Protein domain hierarchy Gibbs sampling strategies

Abstract Hierarchically-arranged multiple sequence alignment profiles are useful for modeling protein domains that have functionally diverged into evolutionarily-related subgroups. Currently such alignment hierarchies are largely constructed through manual curation, as for the NCBI Conserved Domain Database (CDD). Recently, however, I developed a Gibbs sampler that uses an approach termed statistical evolutionary dynamics analysis to accomplish this task automatically while, at the same time, identifying sequence determinants of protein function. Here I describe the statistical model and sampling strategies underlying this sampler. When implemented and applied to simulated protein sequences (which conform to the underlying statistical model precisely), these sampling strategies efficiently converge on the hierarchy used to generate the sequences. However, for real protein sequences the sampler finds alternative, nearly-optimal hierarchies for many domains, indicating a significant degree of ambiguity. I illustrate how both the nature of such ambiguities and the most robust (“consensus”) features of a hierarchy may be determined from an ensemble of independently generated hierarchies for the same domain. Such consensus hierarchies can provide reliably stable models of protein domain functional divergence.

[1]  Jun S. Liu,et al.  Monte Carlo strategies in scientific computing , 2001 .

[2]  Jorma Rissanen,et al.  Minimum Description Length Principle , 2010, Encyclopedia of Machine Learning.

[3]  J. Felsenstein CONFIDENCE LIMITS ON PHYLOGENIES: AN APPROACH USING THE BOOTSTRAP , 1985, Evolution; international journal of organic evolution.

[4]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[5]  Jun S. Liu,et al.  Covariance structure of the Gibbs sampler with applications to the comparisons of estimators and augmentation schemes , 1994 .

[6]  Tim Hesterberg,et al.  Monte Carlo Strategies in Scientific Computing , 2002, Technometrics.

[7]  Alejandro A. Schäffer,et al.  PSI-BLAST pseudocounts and the minimum description length principle , 2008, Nucleic acids research.

[8]  Andrew F Neuwald,et al.  Bayesian shadows of molecular mechanisms cast in the light of evolution. , 2006, Trends in biochemical sciences.

[9]  Stephen G. Walker A Gibbs Sampling Alternative to Reversible Jump MCMC , 2009 .

[10]  J. Huelsenbeck,et al.  Potential applications and pitfalls of Bayesian inference of phylogeny. , 2002, Systematic biology.

[11]  Dong Xie,et al.  BEAST 2: A Software Platform for Bayesian Evolutionary Analysis , 2014, PLoS Comput. Biol..

[12]  Andrew F. Neuwald Evaluating, Comparing, and Interpreting Protein Domain Hierarchies , 2014, J. Comput. Biol..

[13]  Jordan L. Boyd-Graber,et al.  Dirichlet Mixtures, the Dirichlet Process, and the Structure of Protein Space , 2013, J. Comput. Biol..

[14]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[15]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Andrew F. Neuwald,et al.  Rapid detection, classification and accurate alignment of up to a million or more related protein sequences , 2009, Bioinform..

[17]  Andrew F. Neuwald,et al.  A Bayesian Sampler for Optimization of Protein Domain Hierarchies , 2014, J. Comput. Biol..

[18]  Jorja G. Henikoff,et al.  Using substitution probabilities to improve position-specific scoring matrices , 1996, Comput. Appl. Biosci..

[19]  M. Holder,et al.  Phylogeny estimation: traditional and Bayesian approaches , 2003, Nature Reviews Genetics.

[20]  Andrew F Neuwald Surveying the Manifold Divergence of an Entire Protein Class for Statistical Clues to Underlying Biochemical Mechanisms , 2011, Statistical applications in genetics and molecular biology.

[21]  John P. Huelsenbeck,et al.  MRBAYES: Bayesian inference of phylogenetic trees , 2001, Bioinform..

[22]  Jun S. Liu,et al.  Gapped alignment of protein sequence motifs through Monte Carlo optimization of a hidden Markov model , 2004, BMC Bioinformatics.

[23]  Narmada Thanki,et al.  CDD: a Conserved Domain Database for the functional annotation of proteins , 2010, Nucleic Acids Res..

[24]  Benjamin A. Shoemaker,et al.  CDD: a database of conserved domain alignments with links to domain three-dimensional structure , 2002, Nucleic Acids Res..

[25]  Benjamin D. Redelings,et al.  BAli-Phy: simultaneous Bayesian inference of alignment and phylogeny , 2006, Bioinform..

[26]  Rong Chen,et al.  Lookahead Strategies for Sequential Monte Carlo , 2013, 1302.5206.