Protein Fold Recognition Using Segmentation Conditional Random Fields (SCRFs)

Protein fold recognition is an important step towards understanding protein three-dimensional structures and their functions. A conditional graphical model, i.e., segmentation conditional random fields (SCRFs), is proposed as an effective solution to this problem. In contrast to traditional graphical models, such as the hidden Markov model (HMM), SCRFs follow a discriminative approach. Therefore, it is flexible to include any features in the model, such as overlapping or long-range interaction features over the whole sequence. The model also employs a convex optimization function, which results in globally optimal solutions to the model parameters. On the other hand, the segmentation setting in SCRFs makes their graphical structures intuitively similar to the protein 3-D structures and more importantly provides a framework to model the long-range interactions between secondary structures directly. Our model is applied to predict the parallel beta-helix fold, an important fold in bacterial pathogenesis and carbohydrate binding/cleavage. The cross-family validation shows that SCRFs not only can score all known beta-helices higher than non-beta-helices in the Protein Data Bank (PDB), but also accurately locates rungs in known beta-helix proteins. Our method outperforms BetaWrap, a state-of-the-art algorithm for predicting beta-helix folds, and HMMER, a general motif detection algorithm based on HMM, and has the additional advantage of general application to other protein folds. Applying our prediction model to the Uniprot Database, we identify previously unknown potential beta-helices.

[1]  W. Bruce Croft,et al.  Table extraction using conditional random fields , 2003, DG.O.

[2]  B. Berger,et al.  betawrap: Successful prediction of parallel β-helices from primary sequence reveals an association with many microbial pathogens , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[3]  M. M. Harding,et al.  Proteins and nucleic acids by M. F. Perutz , 1964 .

[4]  Lenore Cowen,et al.  Predicting the Beta-Helix Fold from Protein Sequence Data , 2002, J. Comput. Biol..

[5]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[6]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[7]  M. Yoder,et al.  New domain motif: the structure of pectate lyase C, a secreted plant virulence factor. , 1993, Science.

[8]  Richard Hughey,et al.  Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..

[9]  Wei Chu,et al.  A graphical model for protein secondary structure prediction , 2004, ICML.

[10]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[11]  Philip E. Bourne,et al.  CE-MC: a multiple protein structure alignment server , 2004, Nucleic Acids Res..

[12]  J. Thornton,et al.  Prediction of strand pairing in antiparallel and parallel β‐sheets using information theory , 2002, Proteins.

[13]  Michael I. Jordan Learning in Graphical Models , 1999, NATO ASI Series.

[14]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[15]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[16]  Lenore Cowen,et al.  Wrap-and-pack: a new paradigm for beta structural motif recognition with application to recognizing beta trefoils , 2004, RECOMB '04.

[17]  J. M. Hammersley,et al.  Markov fields on finite graphs and lattices , 1971 .

[18]  Rolf Apweiler,et al.  UniProt archive , 2004, Bioinform..

[19]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[20]  M. Yoder,et al.  The parallel β helix and other coiled folds , 1995, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.

[21]  William W. Cohen,et al.  Semi-Markov Conditional Random Fields for Information Extraction , 2004, NIPS.

[22]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[23]  Simon Kasif,et al.  Protein Secondary-Structure Modeling with Probabilistic Networks , 1993, ISMB.

[24]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[25]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[26]  V. Thorsson,et al.  HMMSTR: a hidden Markov model for local sequence-structure correlations in proteins. , 2000, Journal of molecular biology.

[27]  Martial Hebert,et al.  Discriminative random fields: a discriminative framework for contextual interaction in classification , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[28]  Jaime G. Carbonell,et al.  Comparison of probabilistic combination methods for protein secondary structure prediction , 2004, Bioinform..

[29]  J. King,et al.  β‐Helix core packing within the triple‐stranded oligomerization domain of the P22 tailspike , 2000, Protein science : a publication of the Protein Society.