Hidden Markov models for remote protein homology detection

Genome sequencing projects are advancing at a staggering pace and are daily producing large amounts of sequence data. However, the experimental characterization of the encoded genes and proteins is lagging far behind. Interpretation of genomic sequences therefore largely relies on computational algorithms and on transferring annotation from characterized proteins to related uncharacterized proteins. Detection of evolutionary relationships between sequences protein homology detection – has become one of the main fields of computational biology. Arguably the most successful technique for modeling protein homology is the Hidden Markov Model (HMM), which is based on a probabilistic framework. This thesis describes improvements to protein homology detection methods. The main part of the work is devoted to profile HMMs that are used in database searches to identify homologous protein sequences that belong to the same protein family. The key step is the model estimation which aims to create a model that generalizes an often limited and biased training set to the entire protein family including members that have not yet been observed. This thesis addresses several issues in model estimation: i) prior probability settings, pointing at a conflict between modeling true positives and high discrimination; ii) discriminative training, by proposing an algorithm that adapts model parameters from non-homologous sequences; and iii) key HMM parameters, assessing the relative importance of different aspects of the estimation process, leading to an optimized procedure. Taken together, the work extends our knowledge of theoretical aspects of profile HMMs and can immediately be used for improved protein homology detection by profile HMMs. If related sequences are highly divergent, standard methods often fail to detect homology. The superfamily of G protein-coupled receptors (GPCRs) can be divided into families with almost complete lack of sequence similarity, yet sharing the same seven membrane-spanning topology. It would not be possible to construct a profile HMM that models the entire superfamily. We instead analyzed the GPCR superfamily and found conserved features in the amino acid distributions and lengths of membrane and non-membrane regions. Based on those high-level features we estimated an HMM (GPCRHMM), with the specific goal of detecting remotely related GPCRs. GPCRHMM is, at comparable error rates, much more sensitive than other strategies for GPCR discovery. In a search of five genomes we predicted 120 sequences that lacked previous annotation as possible GPCRs. The majority of these predictions (102) were in C. elegans, but also 4 were found in human and 7 in mouse. LIST OF PUBLICATIONS I. Wistrand, M and Sonnhammer, ELL Transition priors for protein hidden Markov models: an empirical study towards maximum discrimination. Journal of Computational Biology, 2004, 11(1), 181-193 II. Wistrand, M and Sonnhammer, ELL Improving profile HMM discrimination by adapting transition prior probabilities. Journal of Molecular Biology, 2004, 338(4), 847-854 III. Wistrand, M and Sonnhammer, ELL Improved profile HMM performance by assessment of critical algorithmic features in SAM and HMMER. BMC Bioinformatics, 2005, 6(1):99 IV. Wistrand, M*, Kall, L* and Sonnhammer, ELL A general model of G protein-coupled receptor sequences and its application to detect remote homologs. Accepted for publication in Protein Science * These authors contributed equally to the project

[1]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[2]  M. Gerstein,et al.  Annotation transfer for genomics: measuring functional divergence in multi-domain proteins. , 2001, Genome research.

[3]  J. Carlson,et al.  Molecular evolution of the insect chemoreceptor gene superfamily in Drosophila melanogaster , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[4]  D. Fischer,et al.  Analysis of singleton ORFans in fully sequenced microbial genomes , 2003, Proteins.

[5]  Cyrus Chothia,et al.  The SUPERFAMILY database in 2004: additions and improvements , 2004, Nucleic Acids Res..

[6]  Kevin Karplus,et al.  Evaluation of local structure alphabets based on residue burial , 2004, Proteins.

[7]  R. Fleischmann,et al.  Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. , 1995, Science.

[8]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[9]  A. Lesk,et al.  The relation between the divergence of sequence and structure in proteins. , 1986, The EMBO journal.

[10]  S. Altschul Amino acid substitution matrices from an information theoretic perspective , 1991, Journal of Molecular Biology.

[11]  S. Mitaku,et al.  Identification of G protein‐coupled receptor genes from the human genome sequence , 2002, FEBS letters.

[12]  Ralf Morgenstern,et al.  Bioinformatic and enzymatic characterization of the MAPEG superfamily , 2005, The FEBS journal.

[13]  E. Sonnhammer,et al.  Classification of transmembrane protein families in the Caenorhabditis elegans genome and identification of human orthologs. , 2000, Genome research.

[14]  Michael Lynch,et al.  Gene Duplication and Evolution , 2002, Science.

[15]  M. Borodovsky,et al.  GeneMark.hmm: new solutions for gene finding. , 1998, Nucleic acids research.

[16]  E. Lindahl,et al.  Identification of related proteins on family, superfamily and fold level. , 2000, Journal of molecular biology.

[17]  Elena Rivas,et al.  Evolutionary models for insertions and deletions in a probabilistic modeling framework , 2005, BMC Bioinformatics.

[18]  Masatoshi Nei,et al.  Selectionism and neutralism in molecular evolution. , 2005, Molecular biology and evolution.

[19]  K. Karplus,et al.  Hidden Markov models that use predicted local structure for fold recognition: Alphabets of backbone geometry , 2003, Proteins.

[20]  G. Tusnády,et al.  Principles governing amino acid composition of integral membrane proteins: application to topology prediction. , 1998, Journal of molecular biology.

[21]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[22]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[23]  Arne Elofsson,et al.  A study on protein sequence alignment quality , 2002, Proteins.

[24]  A. Hopkins,et al.  The druggable genome , 2002, Nature Reviews Drug Discovery.

[25]  David Haussler,et al.  Using Dirichlet Mixture Priors to Derive Hidden Markov Models for Protein Families , 1993, ISMB.

[26]  J. Massagué TGF-beta signal transduction. , 1998, Annual review of biochemistry.

[27]  H. Schiöth,et al.  The G-protein-coupled receptors in the human genome form five main families. Phylogenetic analysis, paralogon groups, and fingerprints. , 2003, Molecular pharmacology.

[28]  Peter D. Karp,et al.  Database verification studies of SWISS-PROT and GenBank , 2001, Bioinform..

[29]  C. Chothia,et al.  Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. , 2001, Journal of molecular biology.

[30]  Anders Krogh,et al.  Hidden Markov models for sequence analysis: extension and analysis of the basic method , 1996, Comput. Appl. Biosci..

[31]  William R. Pearson,et al.  Empirical determination of effective gap penalties for sequence comparison , 2002, Bioinform..

[32]  C. Chothia,et al.  Evolution of the Protein Repertoire , 2003, Science.

[33]  A. Elofsson,et al.  Multi-domain proteins in the three kingdoms of life: orphan domains and other unassigned regions. , 2005, Journal of molecular biology.

[34]  Kevin Karplus,et al.  Evaluating Regularizers for Estimating Distributions of Amino Acids , 1995, ISMB.

[35]  S. Wuchty Scale-free behavior in protein domain networks. , 2001, Molecular biology and evolution.

[36]  C. Sander,et al.  Genome sequences and great expectations , 2000, Genome Biology.

[37]  A. Elofsson,et al.  Domain rearrangements in protein evolution. , 2005, Journal of molecular biology.

[38]  S. Eddy Hidden Markov models. , 1996, Current opinion in structural biology.

[39]  V. Thorsson,et al.  HMMSTR: a hidden Markov model for local sequence-structure correlations in proteins. , 2000, Journal of molecular biology.

[40]  Nick V Grishin,et al.  Access the most recent version at doi: 10.1110/ps.03197403 References , 2003 .

[41]  W. Taylor,et al.  Identification of protein sequence homology by consensus template alignment. , 1986, Journal of molecular biology.

[42]  A. Chess,et al.  Identification of candidate Drosophila olfactory receptors from genomic DNA sequence. , 1999, Genomics.

[43]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[44]  A. Krogh,et al.  A combined transmembrane topology and signal peptide prediction method. , 2004, Journal of molecular biology.

[45]  W. Fitch Homology a personal view on some of the problems. , 2000, Trends in genetics : TIG.

[46]  I. Holmes,et al.  Using guide trees to construct multiple-sequence evolutionary HMMs , 2003, ISMB.

[47]  Hiroshi Mamitsuka,et al.  A Learning Method of Hidden Markov Models for Sequence Discrimination , 1996, J. Comput. Biol..

[48]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[49]  Kimmen Sjölander,et al.  A comparison of scoring functions for protein sequence profile alignment , 2004, Bioinform..

[50]  Kimmen Sjölander,et al.  COACH : profile-profile alignment of protein families using hidden Markov models , 2003 .

[51]  Ian Holmes,et al.  Evolutionary HMMs: a Bayesian approach to multiple alignment , 2001, Bioinform..

[52]  R. Durbin,et al.  Pfam: A comprehensive database of protein domain families based on seed alignments , 1997, Proteins.

[53]  G. Mitchison A Probabilistic Treatment of Phylogeny and Sequence Alignment , 1999, Journal of Molecular Evolution.

[54]  N. Grishin,et al.  COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. , 2003, Journal of molecular biology.

[55]  M. Delorenzi,et al.  An HMM model for coiled-coil domains and a comparison with PSSM-based predictions , 2002, Bioinform..

[56]  A. Elofsson,et al.  Hidden Markov models that use predicted secondary structures for fold recognition , 1999, Proteins.

[57]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[58]  S. Teichmann,et al.  Domain combinations in archaeal, eubacterial and eukaryotic proteomes. , 2001, Journal of molecular biology.

[59]  D. Haussler,et al.  A hidden Markov model that finds genes in E. coli DNA. , 1994, Nucleic acids research.

[60]  Burkhard Rost,et al.  Did evolution leap to create the protein universe? , 2002, Current opinion in structural biology.

[61]  Arne Elofsson,et al.  Profile–profile methods provide improved fold‐recognition: A study of different profile–profile alignment methods , 2004, Proteins.

[62]  Kolakowski Lf GCRDB: A G-PROTEIN-COUPLED RECEPTOR DATABASE , 1994 .

[63]  E. Koonin,et al.  Selection in the evolution of gene duplications , 2002, Genome Biology.

[64]  Anders Krogh,et al.  EasyGene – a prokaryotic gene finder that ranks ORFs by statistical significance , 2003, BMC Bioinformatics.

[65]  J. Garnier,et al.  Fold recognition using predicted secondary structure sequences and hidden Markov models of protein folds , 1997, Proteins.

[66]  Alex Bateman,et al.  Enhanced protein domain discovery using taxonomy , 2004, BMC Bioinformatics.

[67]  S. Teichmann,et al.  The relationship between domain duplication and recombination. , 2005, Journal of molecular biology.

[68]  John R. Carlson,et al.  A Novel Family of Divergent Seven-Transmembrane Proteins Candidate Odorant Receptors in Drosophila , 1999, Neuron.

[69]  Robert D. Finn,et al.  The Pfam protein families database , 2004, Nucleic Acids Res..

[70]  C. Chothia,et al.  Volume changes in protein evolution. , 1994, Journal of molecular biology.

[71]  Johannes Söding,et al.  Protein homology detection by HMM?CHMM comparison , 2005, Bioinform..

[72]  M. Lynch,et al.  The altered evolutionary trajectories of gene duplicates. , 2004, Trends in genetics : TIG.

[73]  A. Krogh,et al.  Prediction of lipoprotein signal peptides in Gram‐negative bacteria , 2003, Protein science : a publication of the Protein Society.

[74]  Erik L. L. Sonnhammer,et al.  A Hidden Markov Model for Predicting Transmembrane Helices in Protein Sequences , 1998, ISMB.

[75]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[76]  Richard Hughey,et al.  Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..

[77]  Anders Krogh,et al.  Two Methods for Improving Performance of a HMM and their Application for Gene Finding , 1997, ISMB.

[78]  M. Madera,et al.  A comparison of profile hidden Markov model procedures for remote homology detection. , 2002, Nucleic acids research.

[79]  M Vingron,et al.  Phylogenetic information improves homology detection , 2001, Proteins.

[80]  S. Firestein,et al.  The olfactory receptor gene superfamily of the mouse , 2002, Nature Neuroscience.

[81]  A. Fink Natively unfolded proteins. , 2005, Current opinion in structural biology.

[82]  Cynthia Friedman,et al.  Different evolutionary processes shaped the mouse and human olfactory receptor gene families. , 2002, Human molecular genetics.

[83]  M. Gerstein,et al.  Annotation Transfer for Genomics: Measuring Functional Divergence in Multi-Domain Proteins , 2001, Genome Research.

[84]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[85]  Etsuko N. Moriyama,et al.  Identification of novel multi-transmembrane proteins from genomic databases using quasi-periodic structural properties , 2000, Bioinform..

[86]  R. Durbin,et al.  Tree-based maximal likelihood substitution matrices and hidden Markov models , 1995, Journal of Molecular Evolution.

[87]  Wen-Hsiung Li,et al.  Fundamentals of molecular evolution , 1990 .

[88]  Richard Hughey,et al.  Weighting hidden Markov models for maximum discrimination , 1998, Bioinform..

[89]  Robert Fredriksson,et al.  The GRAFS classification system of G-protein coupled receptors in comparative perspective. , 2005, General and comparative endocrinology.

[90]  Sean R. Eddy,et al.  Maximum Discrimination Hidden Markov Models of Sequence Consensus , 1995, J. Comput. Biol..

[91]  C. Chothia,et al.  The linked conservation of structure and function in a family of high diversity: the monomeric cupredoxins. , 2004, Structure.

[92]  R. Durbin,et al.  Enhanced protein domain discovery by using language modeling techniques from speech recognition , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[93]  Richard Hughey,et al.  Calibrating E-values for hidden Markov models using reverse-sequence null models , 2005, Bioinform..

[94]  G. Heijne The distribution of positively charged residues in bacterial inner membrane proteins correlates with the trans‐membrane topology , 1986, The EMBO journal.

[95]  G. Crooks,et al.  A generalized affine gap model significantly improves protein sequence alignment accuracy , 2004, Proteins.

[96]  Andrey Rzhetsky,et al.  A Spatial Map of Olfactory Receptor Expression in the Drosophila Antenna , 1999, Cell.

[97]  D. Haussler,et al.  Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. , 1998, Journal of molecular biology.

[98]  Anders Krogh,et al.  Prediction of Signal Peptides and Signal Anchors by a Hidden Markov Model , 1998, ISMB.

[99]  M. Gerstein,et al.  Protein family and fold occurrence in genomes: power-law behaviour and evolutionary model. , 2001, Journal of molecular biology.

[100]  Golan Yona,et al.  Within the twilight zone: a sensitive profile-profile comparison tool based on information theory. , 2002, Journal of molecular biology.

[101]  Richard Hughey,et al.  Scoring hidden Markov models , 1997, Comput. Appl. Biosci..

[102]  S. Foord Receptor classification: post genome. , 2002, Current opinion in pharmacology.

[103]  S. Teichmann,et al.  Supra-domains: evolutionary units larger than single protein domains. , 2004, Journal of molecular biology.

[104]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[105]  Anders Krogh,et al.  Maximum Entropy Weighting of Aligned Sequences of Proteins or DNA , 1995, ISMB.

[106]  Jorja G. Henikoff,et al.  Using substitution probabilities to improve position-specific scoring matrices , 1996, Comput. Appl. Biosci..

[107]  David Haussler,et al.  Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology , 1996, Comput. Appl. Biosci..

[108]  David Haussler,et al.  A Discriminative Framework for Detecting Remote Protein Homologies , 2000, J. Comput. Biol..

[109]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[110]  S. Pietrokovski Searching databases of conserved sequence regions by aligning protein multiple-alignments. , 1996, Nucleic acids research.

[111]  S. Altschul,et al.  Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks. , 1994, Proceedings of the National Academy of Sciences of the United States of America.