MRFy: Remote Homology Detection for Beta-Structural Proteins Using Markov Random Fields and Stochastic Search

We introduce MRFy, a tool for protein remote homology detection that captures beta-strand dependencies in the Markov random field. Over a set of 11 SCOP beta-structural superfamilies, MRFy shows a 14 percent improvement in mean Area Under the Curve for the motif recognition problem as compared to HMMER, 25 percent improvement as compared to RAPTOR, 14 percent improvement as compared to HHPred, and a 18 percent improvement as compared to CNFPred and RaptorX. MRFy was implemented in the Haskell functional programming language, and parallelizes well on multi-core systems. MRFy is available, as source code as well as an executable, from http://mrfy.cs.tufts.edu/.

[1]  Thomas L. Madden,et al.  Domain enhanced lookup time accelerated BLAST , 2012, Biology Direct.

[2]  Temple F. Smith,et al.  Global optimum protein threading with gapped alignment and empirical pair score functions. , 1996, Journal of molecular biology.

[3]  Tim J. P. Hubbard,et al.  SCOP database in 2004: refinements integrate structure and sequence family data , 2004, Nucleic Acids Res..

[4]  Richard Hughey,et al.  Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..

[5]  K. Dill,et al.  From Levinthal to pathways to funnels , 1997, Nature Structural Biology.

[6]  Liam J. McGuffin,et al.  The PSIPRED protein structure prediction server , 2000, Bioinform..

[7]  Norman Ramsey,et al.  Experience report: Haskell in computational biology , 2012, ICFP '12.

[8]  Richard C. Wilson,et al.  Flexible structural protein alignment by a sequence of local transformations , 2009, Bioinform..

[9]  Lode Wyns,et al.  SABmark- a benchmark for sequence alignment that covers the entire known fold space , 2005, Bioinform..

[10]  Roland L. Dunbrack Sequence comparison and protein structure prediction. , 2006, Current opinion in structural biology.

[11]  Lenore Cowen,et al.  Augmented training of hidden Markov models to recognize remote homologs via simulated evolution , 2009, Bioinform..

[12]  R. Nussinov,et al.  Folding funnels, binding funnels, and protein function , 1999, Protein science : a publication of the Protein Society.

[13]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[14]  Eytan Domany,et al.  Automated assignment of SCOP and CATH protein structure classifications from FSSP scores , 2002, Proteins.

[15]  Scott Kirkpatrick,et al.  Optimization by Simmulated Annealing , 1983, Sci..

[16]  J. Skolnick,et al.  TM-align: a protein structure alignment algorithm based on the TM-score , 2005, Nucleic acids research.

[17]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems , 1988 .

[18]  Frank Thomson Leighton,et al.  Protein folding in the hydrophobic-hydrophilic (HP) is NP-complete , 1998, RECOMB '98.

[19]  V. Pande,et al.  Using massively parallel simulation and Markovian models to study protein folding: examining the dynamics of the villin headpiece. , 2006, The Journal of chemical physics.

[20]  Bonnie Kirkpatrick,et al.  STRALCP—structure alignment-based clustering of proteins , 2007, Nucleic acids research.

[21]  Thomas Mailund,et al.  Rapid Neighbour-Joining , 2008, WABI.

[22]  Suganthi Balasubramanian,et al.  Protein alchemy: Changing β-sheet into α-helix , 1997, Nature Structural Biology.

[23]  Chris Bailey-Kellogg,et al.  Graphical Models of Residue Coupling in Protein Families , 2008, IEEE ACM Trans. Comput. Biol. Bioinform..

[24]  Stella Veretnik,et al.  Partitioning protein structures into domains: why is it so difficult? , 2006, Journal of molecular biology.

[25]  Bonnie Berger,et al.  Entropy-scaling search of massive biological data , 2015, Cell systems.

[26]  KharHengChoo,et al.  Recent Applications of Hidden Markov Models in Computational Biology , 2004 .

[27]  Liisa Holm,et al.  DaliLite workbench for protein structure comparison , 2000, Bioinform..

[28]  Thomas Madej,et al.  Threading analysis suggests that the obese gene product may be a helical cytokine , 1995, FEBS letters.

[29]  Michael I. Jordan,et al.  Probabilistic Independence Networks for Hidden Markov Probability Models , 1997, Neural Computation.

[30]  Ines Thiele,et al.  Three-Dimensional Structural View of the Central Metabolic Network of Thermotoga maritima , 2009, Science.

[31]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[32]  David C. Jones,et al.  GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. , 1999, Journal of molecular biology.

[33]  Arthur Brady,et al.  Fault Tolerance in Protein Interaction Networks: Stable Bipartite Subgraphs and Redundant Pathways , 2009, PloS one.

[34]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[35]  A. Lesk,et al.  The relation between the divergence of sequence and structure in proteins. , 1986, The EMBO journal.

[36]  B. Berger,et al.  Compressive genomics , 2012, Nature Biotechnology.

[37]  John B. Anderson,et al.  CDD: a Conserved Domain Database for protein classification , 2004, Nucleic Acids Res..

[38]  J F Gibrat,et al.  Surprising similarities in structure comparison. , 1996, Current opinion in structural biology.

[39]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[40]  F M Poulsen,et al.  Primary structure of barwin: a barley seed protein closely related to the C-terminal domain of proteins encoded by wound-induced plant genes. , 1992, Biochemistry.

[41]  Osvaldo Olmea,et al.  MAMMOTH (Matching molecular models obtained from theory): An automated method for model comparison , 2002, Protein science : a publication of the Protein Society.

[42]  G. Churchill Stochastic models for heterogeneous DNA sequences. , 1989, Bulletin of mathematical biology.

[43]  Michael Levitt,et al.  Evolutionarily consistent families in SCOP: sequence, structure and function , 2012, BMC Structural Biology.

[44]  Lorna J. Smith,et al.  Understanding protein folding via free-energy surfaces from theory and experiment. , 2000, Trends in biochemical sciences.

[45]  Anders Krogh,et al.  Hidden Markov models for sequence analysis: extension and analysis of the basic method , 1996, Comput. Appl. Biosci..

[46]  Yaoqi Zhou,et al.  Improving protein fold recognition and template-based modeling by employing probabilistic-based matching between predicted one-dimensional structural properties of query and corresponding native properties of templates , 2011, Bioinform..

[47]  David T. Jones,et al.  Protein superfamilles and domain superfolds , 1994, Nature.

[48]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[49]  Yuan Qi,et al.  SCOPmap: Automated assignment of protein structures to evolutionary superfamilies , 2004, BMC Bioinformatics.

[50]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[51]  M. A. McClure,et al.  Hidden Markov models of biological primary sequence information. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[52]  A. Biegert,et al.  HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment , 2011, Nature Methods.

[53]  C Sander,et al.  Specific recognition in the tertiary structure of beta-sheets of proteins. , 1980, Journal of molecular biology.

[54]  Lawrence Davis,et al.  Bit-Climbing, Representational Bias, and Test Suite Design , 1991, ICGA.

[55]  Markus Wistrand,et al.  Improving profile HMM discrimination by adapting transition probabilities. , 2004, Journal of molecular biology.

[56]  R. Kolodny,et al.  Protein structure comparison: implications for the nature of 'fold space', and structure and function prediction. , 2006, Current opinion in structural biology.

[57]  Chris Sander,et al.  Touring protein fold space with Dali/FSSP , 1998, Nucleic Acids Res..

[58]  David C. Jones,et al.  Progress in protein structure prediction. , 1997, Current opinion in structural biology.

[59]  Chris Sander,et al.  The FSSP database: fold classification based on structure-structure alignment of proteins , 1996, Nucleic Acids Res..

[60]  Sung-Hou Kim,et al.  Evolution of protein structural classes and protein sequence families , 2006, Proceedings of the National Academy of Sciences.

[61]  E. Bornberg-Bauer,et al.  Modeling evolutionary landscapes: mutational stability, topology, and superfunnels in sequence space. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[62]  Alva L. Couch,et al.  The Maelstrom: Network Service Debugging via "Ineffective Procedures" , 2001, LISA.

[63]  D. T. Jones,et al.  A new approach to protein fold recognition , 1992, Nature.

[64]  B. Rost Twilight zone of protein sequence alignments. , 1999, Protein engineering.

[65]  Lenore Cowen,et al.  Matt: Local Flexibility Aids Protein Multiple Structure Alignment , 2008, PLoS Comput. Biol..

[66]  R A Sayle,et al.  RASMOL: biomolecular graphics for all. , 1995, Trends in biochemical sciences.

[67]  Johannes Söding,et al.  The HHpred interactive server for protein homology detection and structure prediction , 2005, Nucleic Acids Res..

[68]  G J Williams,et al.  The Protein Data Bank: a computer-based archival file for macromolecular structures. , 1978, Archives of biochemistry and biophysics.

[69]  Lenore Cowen,et al.  SMURFLite: combining simplified Markov random fields with simulated evolution improves remote homology detection for beta-structural proteins into the twilight zone , 2012, Bioinform..

[70]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[71]  John H. Holland,et al.  Cognitive systems based on adaptive algorithms , 1977, SGAR.

[72]  Ruben E. Valas,et al.  Nothing about protein structure classification makes sense except in the light of evolution. , 2009, Current opinion in structural biology.

[73]  Jianzhu Ma,et al.  A conditional neural fields model for protein threading , 2013 .

[74]  Michael I. Jordan,et al.  Loopy Belief Propagation for Approximate Inference: An Empirical Study , 1999, UAI.

[75]  Nick V Grishin,et al.  Discrete-continuous duality of protein structure space. , 2009, Current opinion in structural biology.

[76]  Manfred J. Sippl,et al.  QSCOP - SCOP quantified by structural relationships , 2007, Bioinform..

[77]  Lenore Cowen,et al.  Predicting the Beta-Helix Fold from Protein Sequence Data , 2002, J. Comput. Biol..

[78]  András Kocsor,et al.  ROC analysis: applications to the classification of biological sequences and 3D structures , 2008, Briefings Bioinform..

[79]  Lenore Cowen,et al.  Remote homology detection on alpha-structural proteins using simulated evolution , 2012, BCB.

[80]  T. Bhat,et al.  The Protein Data Bank and the challenge of structural genomics , 2000, Nature Structural Biology.

[81]  Ryan Day,et al.  A consensus view of fold space: Combining SCOP, CATH, and the Dali Domain Dictionary , 2003, Protein science : a publication of the Protein Society.

[82]  Frances M. G. Pearl,et al.  Quantifying the similarities within fold space. , 2002, Journal of molecular biology.

[83]  Cyrus Chothia,et al.  The SUPERFAMILY database in 2007: families and functions , 2006, Nucleic Acids Res..

[84]  James E. Bray,et al.  The CATH database: an extended protein family resource for structural and functional genomics , 2003, Nucleic Acids Res..

[85]  Amos Bairoch,et al.  The PROSITE database , 2005, Nucleic Acids Res..

[86]  L. Carin,et al.  Sequential modeling for identifying CpG island locations in human genome , 2002, IEEE Signal Processing Letters.

[87]  Lenore Cowen,et al.  Markov random fields reveal an N-terminal double beta-propeller motif as part of a bacterial hybrid two-component sensor system , 2010, Proceedings of the National Academy of Sciences.

[88]  Jinbo Xu,et al.  A multiple‐template approach to protein threading , 2011, Proteins.

[89]  John P. Overington,et al.  HOMSTRAD: A database of protein structure alignments for homologous families , 1998, Protein science : a publication of the Protein Society.

[90]  P E Bourne,et al.  Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. , 1998, Protein engineering.

[91]  Volker Heun,et al.  Approximate protein folding in the HP side chain model on extended cubic lattices , 1999, Discret. Appl. Math..

[92]  Steven Johnson Rob Mitra Tim Schedl Jim Skeath Gar Stormo,et al.  REMOTE PROTEIN HOMOLOGY DETECTION USING HIDDEN MARKOV MODELS , 2006 .

[93]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[94]  Zhiyong Wang,et al.  MRFalign: Protein Homology Detection through Alignment of Markov Random Fields , 2014, PLoS Comput. Biol..

[95]  Lenore Cowen,et al.  Recognition of beta-structural motifs using hidden Markov models trained with simulated evolution , 2010, Bioinform..

[96]  Rachel Kolodny,et al.  Comprehensive evaluation of protein structure alignment methods: scoring by geometric measures. , 2005, Journal of molecular biology.

[97]  Dong Xu,et al.  A fast SCOP fold classification system using content-based E-Predict algorithm , 2005, BMC Bioinformatics.

[98]  Peter Lackner,et al.  Accuracy analysis of multiple structure alignments , 2009, Protein science : a publication of the Protein Society.

[99]  Ying Xu,et al.  Raptor: Optimal Protein Threading by Linear Programming , 2003, J. Bioinform. Comput. Biol..

[100]  W. Kabsch A solution for the best rotation to relate two sets of vectors , 1976 .

[101]  Burkhard Rost,et al.  Did evolution leap to create the protein universe? , 2002, Current opinion in structural biology.

[102]  Jean-François Gibrat,et al.  ROC and confusion analysis of structure comparison methods identify the main causes of divergence from manual protein classification , 2006, BMC Bioinform..

[103]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[104]  M. Gerstein,et al.  Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. , 2000, Journal of molecular biology.

[105]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[106]  Ian Sillitoe,et al.  The CATH classification revisited—architectures reviewed and new ways to characterize structural divergence in superfamilies , 2008, Nucleic Acids Res..

[107]  P E Bourne,et al.  An alternative view of protein fold space , 2000, Proteins.

[108]  T. Smith,et al.  Modeling protein cores with Markov random fields. , 1994, Mathematical biosciences.

[109]  Lenore Cowen,et al.  Formatt: Correcting protein multiple structural alignments by incorporating sequence alignment , 2012, BMC Bioinformatics.

[110]  Arne Elofsson,et al.  A comparison of sequence and structure protein domain families as a basis for structural genomics , 1999, Bioinform..

[111]  Chris Smith,et al.  Parameterization Studies for the SAM and HMMER Methods of Hidden Markov Model Generation , 1996, ISMB.

[112]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[113]  Johannes Söding,et al.  Protein homology detection by HMM?CHMM comparison , 2005, Bioinform..

[114]  A. Hasman,et al.  Probabilistic reasoning in intelligent systems: Networks of plausible inference , 1991 .

[115]  Sitao Wu,et al.  MUSTER: Improving protein sequence profile–profile alignments by using multiple sources of structure information , 2008, Proteins.

[116]  M Levitt,et al.  Comprehensive assessment of automatic structural alignment against a manual standard, the scop classification of proteins , 1998, Protein science : a publication of the Protein Society.

[117]  Robert Huber,et al.  Thermotoga maritima sp. nov. represents a new genus of unique extremely thermophilic eubacteria growing up to 90°C , 1986, Archives of Microbiology.

[118]  Jaime G. Carbonell,et al.  Conditional Graphical Models for Protein Structural Motif Recognition , 2009, J. Comput. Biol..

[119]  B. Rost,et al.  Effective use of sequence correlation and conservation in fold recognition. , 1999, Journal of molecular biology.

[120]  Lenore Cowen,et al.  Touring Protein Space with Matt , 2010, ISBRA.

[121]  Jinbo Xu,et al.  Raptorx: Exploiting structure information for protein alignment by statistical inference , 2011, Proteins.

[122]  Robert D. Finn,et al.  Pfam: clans, web tools and services , 2005, Nucleic Acids Res..

[123]  S. Balasubramanian,et al.  Protein alchemy: changing beta-sheet into alpha-helix. , 1997, Nature structural biology.

[124]  Miha Vuk,et al.  ROC curve, lift chart and calibration plot , 2006, Advances in Methodology and Statistics.

[125]  Tao Jiang,et al.  On the Complexity of Multiple Sequence Alignment , 1994, J. Comput. Biol..

[126]  B. Berger,et al.  betawrap: Successful prediction of parallel β-helices from primary sequence reveals an association with many microbial pathogens , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[127]  Jianzhu Ma,et al.  Protein structure alignment beyond spatial proximity , 2013, Scientific Reports.

[128]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[129]  Frances M. G. Pearl,et al.  The CATH domain structure database: new protocols and classification levels give a more comprehensive resource for exploring evolution , 2006, Nucleic Acids Res..

[130]  RamakrishnanNaren,et al.  Graphical Models of Residue Coupling in Protein Families , 2008 .

[131]  J. Thornton,et al.  Prediction of strand pairing in antiparallel and parallel β‐sheets using information theory , 2002, Proteins.

[132]  D T Jones,et al.  A systematic comparison of protein structure classifications: SCOP, CATH and FSSP. , 1999, Structure.

[133]  Jean-François Gibrat,et al.  Towards an automatic classification of protein structural domains based on structural similarity , 2008, BMC Bioinformatics.

[134]  C. Sander,et al.  Specific recognition in the tertiary structure of β-sheets of proteins , 1980 .

[135]  Frances M. G. Pearl,et al.  CATHEDRAL: A Fast and Effective Algorithm to Predict Folds and Domain Boundaries from Multidomain Protein Structures , 2007, PLoS Comput. Biol..

[136]  Michal Linial,et al.  ClanTox: a classifier of short animal toxins , 2009, Nucleic Acids Res..

[137]  W. Braun,et al.  Sequence specificity, statistical potentials, and three‐dimensional structure prediction with self‐correcting distance geometry calculations of β‐sheet formation in proteins , 2008 .

[138]  Nathan Patrick Palmer,et al.  Comparative modeling of mainly-beta proteins by profile wrapping , 2006 .

[139]  John Moult,et al.  Rigorous performance evaluation in protein structure modelling and implications for computational biology , 2006, Philosophical Transactions of the Royal Society B: Biological Sciences.

[140]  Sarah K. Volkman,et al.  Plasmodium falciparum K76T pfcrt Gene Mutations and Parasite Population Structure, Haiti, 2006–2009 , 2016, Emerging infectious diseases.

[141]  Stella Veretnik,et al.  Toward consistent assignment of structural domains in proteins. , 2004, Journal of molecular biology.

[142]  Lenore Cowen,et al.  Compressive genomics for protein databases , 2013, Bioinform..

[143]  M. Linial,et al.  Raalin, a transcript enriched in the honey bee brain, is a remnant of genomic rearrangement in hymenoptera , 2012, Insect molecular biology.