Fuzzy Profile Hidden Markov Models for Protein Sequence Analysis

Profile HMMs based on classical hidden Markov models have been widely applied for alignment and classification of protein sequence families. The formulation of the forward and backward variables in profile HMMs is made under statistical independence assumption of the probability theory. We propose a fuzzy profile hidden Markov model to overcome the limitations of the statistical independence assumption of probability theory. The strong correlations and the sequence preference involved in the protein structures make fuzzy architecture based models as suitable candidates for building profiles of a given family since fuzzy set can handle uncertainties better than classical methods. The proposed model fuzzifies the forward and backward variables by incorporating Sugeno fuzzy measures using Choquet integrals which is extended to fuzzy Baum-Welch parameter estimation algorithm for profiles. It was built and tested on widely studied globin and kinase family sequences and its performance was compared with classical HMM. A comparative analysis based on Log-Likelihood (LL) scores of sequences and Receiver Operating Characteristic (ROC) demonstrates the superiority of fuzzy profile HMMs over the classical profile model.

[1]  G. Churchill Stochastic models for heterogeneous DNA sequences. , 1989, Bulletin of mathematical biology.

[2]  Kuntal Sengupta,et al.  Use of a novel generalized fuzzy hidden Markov model for speech recognition , 2001, 10th IEEE International Conference on Fuzzy Systems. (Cat. No.01CH37297).

[3]  Joarder Kamruzzaman,et al.  A Fuzzy Viterbi Algorithm for Improved Sequence Alignment and Searching of Proteins , 2005, EvoWorkshops.

[4]  R. Durbin,et al.  Pfam: A comprehensive database of protein domain families based on seed alignments , 1997, Proteins.

[5]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[6]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[7]  G. Klir,et al.  Fuzzy Measure Theory , 1993 .

[8]  Gregory R. Grant,et al.  Bioinformatics - The Machine Learning Approach , 2000, Comput. Chem..

[9]  Paul D. Gader,et al.  Lexicon-driven handwritten word recognition using Choquet fuzzy integral , 1996, 1996 IEEE International Conference on Systems, Man and Cybernetics. Information Intelligence and Systems (Cat. No.96CH35929).

[10]  Paul D. Gader,et al.  Generalized hidden Markov models. I. Theoretical frameworks , 2000, IEEE Trans. Fuzzy Syst..

[11]  Inge Gavat,et al.  Statistical and Hybrid Methods for Speech Recognition in Romanian , 2002, Int. J. Speech Technol..

[12]  Valeria De Fonzo,et al.  Hidden Markov Models in Bioinformatics , 2007 .

[13]  M. Sugeno FUZZY MEASURES AND FUZZY INTEGRALS—A SURVEY , 1993 .

[14]  W. Taylor,et al.  Identification of protein sequence homology by consensus template alignment. , 1986, Journal of molecular biology.

[15]  Anders Krogh,et al.  Chapter 4 - An introduction to hidden Markov models for biological sequences , 1998 .

[16]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[17]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Dat Tran,et al.  Fuzzy hidden Markov models for speech and speaker recognition , 1999, 18th International Conference of the North American Fuzzy Information Processing Society - NAFIPS (Cat. No.99TH8397).