HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment

Sequence-based protein function and structure prediction depends crucially on sequence-search sensitivity and accuracy of the resulting sequence alignments. We present an open-source, general-purpose tool that represents both query and database sequences by profile hidden Markov models (HMMs): 'HMM-HMM–based lightning-fast iterative sequence search' (HHblits; http://toolkit.genzentrum.lmu.de/hhblits/). Compared to the sequence-search tool PSI-BLAST, HHblits is faster owing to its discretized-profile prefilter, has 50–100% higher sensitivity and generates more accurate alignments.

[1]  T. Blundell,et al.  Comparative protein modelling by satisfaction of spatial restraints. , 1993, Journal of molecular biology.

[2]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[3]  Gapped BLAST and PSI-BLAST: A new , 1997 .

[4]  Richard Hughey,et al.  Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..

[5]  A. Murzin How far divergent evolution goes in proteins. , 1998, Current opinion in structural biology.

[6]  Ian Holmes,et al.  Dynamic Programming Alignment Accuracy , 1998, J. Comput. Biol..

[7]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[8]  M. Gerstein,et al.  Annotation Transfer for Genomics: Measuring Functional Divergence in Multi-Domain Proteins , 2001, Genome Research.

[9]  Poethig Rs,et al.  Life with 25,000 genes. , 2001 .

[10]  J. Skolnick,et al.  TM-align: a protein structure alignment algorithm based on the TM-score , 2005, Nucleic acids research.

[11]  Johannes Söding,et al.  The HHpred interactive server for protein homology detection and structure prediction , 2005, Nucleic Acids Res..

[12]  Johannes Söding,et al.  Protein homology detection by HMM?CHMM comparison , 2005, Bioinform..

[13]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[14]  Michael Farrar,et al.  Sequence analysis Striped Smith – Waterman speeds database searches six times over other SIMD implementations , 2007 .

[15]  Tim J. P. Hubbard,et al.  Data growth and its impact on the SCOP database: new developments , 2007, Nucleic Acids Res..

[16]  Johannes Söding,et al.  De novo identification of highly diverged protein repeats by probabilistic consistency , 2008, Bioinform..

[17]  A. Biegert,et al.  Sequence context-specific profiles for homology searching , 2009, Proceedings of the National Academy of Sciences.

[18]  Jeff A. Bilmes,et al.  Learning sparse models for a dynamic Bayesian network classifier of protein secondary structure , 2011, BMC Bioinformatics.

[19]  W. Pearson,et al.  Homologous over-extension: a challenge for iterative similarity searches , 2010, Nucleic acids research.

[20]  Sven Griep,et al.  PDBselect 1992–2009 and PDBfilter-select , 2009, Nucleic Acids Res..

[21]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[22]  Thomas A. Hopf,et al.  Protein 3D Structure Computed from Evolutionary Sequence Variation , 2011, PloS one.

[23]  Johannes Söding,et al.  Protein sequence comparison and fold recognition: progress and good-practice benchmarking. , 2011, Current opinion in structural biology.