Bayesian nonparametrics in protein remote homology search

MOTIVATION Wide application of modeling of three-dimensional protein structures in biomedical research motivates developing protein sequence alignment computer tools featuring high alignment accuracy and sensitivity to remotely homologous proteins. In this paper, we aim at improving the quality of alignments between sequence profiles, encoded multiple sequence alignments. Modeling profile contexts, fixed-length profile fragments, is engaged to achieve this goal. RESULTS We develop a hierarchical Dirichlet process mixture model to describe the distribution of profile contexts, which is able to capture dependencies between amino acids in each context position. The model represents an attempt at modeling profile fragments at several hierarchical levels, within the profile and among profiles. Even modeling unit-length contexts leads to greater improvements than processing 13-length contexts previously. We develop a new profile comparison method, called COMER, integrating the model. A benchmark with three other profile-to-profile comparison methods shows an increase in both sensitivity and alignment quality. AVAILABILITY AND IMPLEMENTATION COMER is open-source software licensed under the GNU GPLv3, available at https://sourceforge.net/projects/comer CONTACT mindaugas.margelevicius@bti.vu.lt SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  A. Tramontano,et al.  Critical assessment of methods of protein structure prediction (CASP)—round IX , 2011, Proteins.

[2]  Yuxing Liao,et al.  ECOD: An Evolutionary Classification of Protein Domains , 2014, PLoS Comput. Biol..

[3]  Liisa Holm,et al.  Searching protein structure databases with DaliLite v.3 , 2008, Bioinform..

[4]  A. Biegert,et al.  HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment , 2011, Nature Methods.

[5]  Steven E. Brenner,et al.  SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures , 2013, Nucleic Acids Res..

[6]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[7]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[8]  David Haussler,et al.  Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology , 1996, Comput. Appl. Biosci..

[9]  Yang Zhang,et al.  I-TASSER: a unified platform for automated protein structure and function prediction , 2010, Nature Protocols.

[10]  E. Venkatraman,et al.  A Permutation Test to Compare Receiver Operating Characteristic Curves , 2000, Biometrics.

[11]  D. Blackwell,et al.  Ferguson Distributions Via Polya Urn Schemes , 1973 .

[12]  Alejandro A. Schäffer,et al.  PSI-BLAST pseudocounts and the minimum description length principle , 2008, Nucleic acids research.

[13]  Christoph Weber,et al.  FFAS server: novel features and applications , 2011, Nucleic Acids Res..

[14]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[15]  Thomas L. Madden,et al.  Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. , 2001, Nucleic acids research.

[16]  Gaetano T. Montelione,et al.  The Protein Structure Initiative: achievements and visions for the future , 2012, F1000 biology reports.

[17]  T. Blundell,et al.  Comparative protein modelling by satisfaction of spatial restraints. , 1993, Journal of molecular biology.

[18]  María Martín,et al.  Activities at the Universal Protein Resource (UniProt) , 2013, Nucleic Acids Res..

[19]  J. Atchison,et al.  Logistic-normal distributions:Some properties and uses , 1980 .

[20]  Johannes Söding,et al.  Protein sequence comparison and fold recognition: progress and good-practice benchmarking. , 2011, Current opinion in structural biology.

[21]  Yang Zhang,et al.  Scoring function for automated assessment of protein structure template quality , 2004, Proteins.

[22]  Torsten Schwede,et al.  Protein modeling: what happened to the "protein structure gap"? , 2013, Structure.

[23]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[24]  Krzysztof Fidelis,et al.  CASP9 results compared to those of previous casp experiments , 2011, Proteins.

[25]  Radford M. Neal,et al.  A Split-Merge Markov chain Monte Carlo Procedure for the Dirichlet Process Mixture Model , 2004 .

[26]  Mindaugas Margelevicius,et al.  Detection of distant evolutionary relationships between protein families using theory of sequence profile-profile comparison , 2010, BMC Bioinformatics.

[27]  Stephen F. Altschul,et al.  The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions , 2005, Bioinform..

[28]  Johannes Söding,et al.  Protein homology detection by HMM?CHMM comparison , 2005, Bioinform..

[29]  Sean R. Eddy,et al.  Accelerated Profile HMM Searches , 2011, PLoS Comput. Biol..

[30]  S. Henikoff,et al.  Position-based sequence weights. , 1994, Journal of molecular biology.

[31]  Krzysztof Fidelis,et al.  CASP10 results compared to those of previous CASP experiments , 2014, Proteins.

[32]  Nick V Grishin,et al.  Discrete-continuous duality of protein structure space. , 2009, Current opinion in structural biology.

[33]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[34]  Andras Fiser,et al.  Trends in structural coverage of the protein universe and the impact of the Protein Structure Initiative , 2014, Proceedings of the National Academy of Sciences.

[35]  Anna Tramontano,et al.  Critical assessment of methods of protein structure prediction (CASP) — round x , 2014, Proteins.

[36]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[37]  W. Pearson,et al.  Homologous over-extension: a challenge for iterative similarity searches , 2010, Nucleic acids research.

[38]  A. Biegert,et al.  Sequence context-specific profiles for homology searching , 2009, Proceedings of the National Academy of Sciences.

[39]  Česlovas Venclovas,et al.  The use of automatic tools and human expertise in template‐based modeling of CASP8 target proteins , 2009, Proteins.

[40]  Yang Zhang,et al.  A comparative assessment and analysis of 20 representative sequence alignment methods for protein structure prediction , 2013, Scientific Reports.

[41]  Jordan L. Boyd-Graber,et al.  Dirichlet Mixtures, the Dirichlet Process, and the Structure of Protein Space , 2013, J. Comput. Biol..