cswHMM: A Novel Context Switching Hidden Markov Model for Biological Sequence Analysis

In this work we created a sequence model that goes beyond simple linear patterns to model a specific type of higher-order relationship possible in biological sequences. Particularly, we seek models that can account for partially overlaid and interleaved patterns in biological sequences. Our proposed context-switching model (cswHMM) is designed as a variable-order hidden Markov model (HMM) with a specific structure that allows switching control between two or more sub-models.Tests of this approach suggest that a combination of HMMs for protein sequence analysis, such as pattern mining based HMMs or profile HMMs, with the context-switching approach can improve the descriptive ability and performance of the models.

[1]  A. Krogh,et al.  Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. , 2001, Journal of molecular biology.

[2]  George Karypis,et al.  CLUTO - A Clustering Toolkit , 2002 .

[3]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[4]  Mark P. Styczynski,et al.  A generic motif discovery algorithm for sequential data. , 2006, Bioinformatics.

[5]  A. Elofsson,et al.  Best α‐helical transmembrane protein topology predictions are achieved using hidden Markov models and evolutionary information , 2004 .

[6]  Judith Klein-Seetharaman,et al.  Computational Biology and Language , 2004, Ambient Intelligence for Scientific Discovery.

[7]  Christopher D. Carothers,et al.  VOGUE: A variable order hidden Markov model with duration based on frequent sequence mining , 2010, TKDD.

[8]  John Riedl,et al.  Generalized suffix trees for biological sequence data: applications and implementation , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[9]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[10]  Golan Yona,et al.  Variations on probabilistic suffix trees: statistical modeling and prediction of protein families , 2001, Bioinform..

[11]  Mikael Bodén,et al.  MEME Suite: tools for motif discovery and searching , 2009, Nucleic Acids Res..

[12]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[13]  Ming Zhang,et al.  A jumping profile Hidden Markov Model and applications to recombination sites in HIV and HCV genomes , 2006, BMC Bioinformatics.

[14]  Amos Bairoch,et al.  Recent improvements to the PROSITE database , 2004, Nucleic Acids Res..

[15]  Simon Cawley,et al.  Applications of generalized pair hidden Markov models to alignment and gene finding problems , 2001, J. Comput. Biol..

[16]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[17]  Richard Hughey,et al.  Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..

[18]  András Fiser,et al.  Structural Characteristics of Novel Protein Folds , 2010, PLoS Comput. Biol..

[19]  Mohammed J. Zaki,et al.  SPADE: An Efficient Algorithm for Mining Frequent Sequences , 2004, Machine Learning.