Protein classification using Hidden Markov models and randomised decision trees

Since the introduction of next generation sequencing there is a demand for sophisticated methods to classify proteins based on sequence data. Two main approaches for this task are to use the raw sequence data and align them against other sequences, or to extract discrete high level features from the protein sequences and compare the features. Two machine learning methods are demonstrated to show each approach. Profile Hidden Markov Models are built from multiple alignment of raw sequence data and learn amino acid emission and transition parameters for a given alignment and effectively harness the power of aligning a test protein to a model built form many proteins. Random Forests on the other hand are used to discriminate between two sets of proteins based on features such as functional amino acid groups and physiochemical properties extracted from the raw sequences. The strengths and limitations of each method are presented and discussed, focussing on the individual merits and how they could work possibly compliment each other rather than just being compared by their classification accuracy.

[1]  Milton H. Saier,et al.  The Transporter Classification Database , 2013, Nucleic Acids Res..

[2]  W. Taylor,et al.  Identification of protein sequence homology by consensus template alignment. , 1986, Journal of molecular biology.

[3]  S. Henikoff,et al.  Scores for sequence searches and alignments. , 1996, Current opinion in structural biology.

[4]  Sean R. Eddy,et al.  Pfam: multiple sequence alignments and HMM-profiles of protein domains , 1998, Nucleic Acids Res..

[5]  S F Altschul,et al.  Local alignment statistics. , 1996, Methods in enzymology.

[6]  Dmitrij Frishman,et al.  Phenylalanine promotes interaction of transmembrane domains via GxxxG motifs. , 2007, Journal of molecular biology.

[7]  R. Doolittle,et al.  A simple method for displaying the hydropathic character of a protein. , 1982, Journal of molecular biology.

[8]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[9]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[10]  P. Suganthan,et al.  AFP-Pred: A random forest approach for predicting antifreeze proteins from sequence-derived properties. , 2011, Journal of theoretical biology.

[11]  Minoru Kanehisa,et al.  AAindex: amino acid index database, progress report 2008 , 2007, Nucleic Acids Res..

[12]  C. Andorfer,et al.  Enhancement of insect antifreeze protein activity by solutes of low molecular mass. , 1998, The Journal of experimental biology.

[13]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[14]  J U Bowie,et al.  Helix packing in membrane proteins. , 1997, Journal of molecular biology.

[15]  Liam J. McGuffin,et al.  The PSIPRED protein structure prediction server , 2000, Bioinform..