Protein secondary structure pattern discovery and its application in secondary structure prediction

A method of protein secondary structure pattern discovery is presented. The TEIRESIAS algorithm has been improved to discover protein secondary structure patterns. Four protein secondary structure pattern dictionaries have been built for four organisms. The distribution of patterns and common patterns' structure in different dictionaries is different. Different organism's proteins represent different biological language. Based on the organism-specific dictionary, a hidden Markov model is built to predict proteins secondary structure. Dictionary-based prediction has been tested on four organisms and compared with the profile network from HeiDelberg (PHD) method. The experimental results show that our predict method is better than the PHD method for modified segment overlap (SOV) assessment.

[1]  G J Barton,et al.  Protein secondary structure prediction. , 1995, Current opinion in structural biology.

[2]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[3]  Stanley Letovsky,et al.  Predicting protein function from protein/protein interaction data: a probabilistic approach , 2003, ISMB.

[4]  M. M. Harding,et al.  Proteins and nucleic acids by M. F. Perutz , 1964 .

[5]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[6]  Alain Viari,et al.  A Double Combinatorial Approach to Discovering Patterns in Biological Sequences , 1996, CPM.

[7]  I. Rigoutsos,et al.  Dictionary-driven protein annotation. , 2002, Nucleic acids research.

[8]  Alfonso Valencia,et al.  Automatic annotation of protein function based on family identification , 2003, Proteins.

[9]  I. Rigoutsos,et al.  Dictionary-driven prokaryotic gene finding. , 2002, Nucleic acids research.

[10]  I. Rigoutsos,et al.  The emergence of pattern discovery techniques in computational biology. , 2000, Metabolic engineering.

[11]  Aris Floratos,et al.  Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm [published erratum appears in Bioinformatics 1998;14(2): 229] , 1998, Bioinform..

[12]  B. Rost,et al.  A modified definition of Sov, a segment‐based measure for protein secondary structure prediction assessment , 1999, Proteins.

[13]  Jaime G. Carbonell,et al.  Comparative ngram analysis of whole-genome sequences , 2002 .

[14]  D. Higgins,et al.  Finding flexible patterns in unaligned protein sequences , 1995, Protein science : a publication of the Protein Society.

[15]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 , 2000, Nucleic Acids Res..

[16]  Hui Fang,et al.  A Study of Statistical Methods for Function Prediction of Protein Motifs , 2004, Applied bioinformatics.

[17]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[18]  C Ouzounis,et al.  Dictionary building via unsupervised hierarchical motif discovery in the sequence space of natural proteins , 1999, Proteins.