HH-suite3 for fast remote homology detection and deep protein annotation

Background HH-suite is a widely used open source software suite for sensitive sequence similarity searches and protein fold recognition. It is based on pairwise alignment of profile Hidden Markov models (HMMs), which represent multiple sequence alignments of homologous sequences. Results We developed a single-instruction multiple-data (SIMD) vectorized implementation of the Viterbi algorithm for profile HMM alignment and introduced various other speed-ups. This accelerated HHsearch by a factor 4 and HHblits by a factor 2 over the previous version 2.0.16. HHblits3 is ~10× faster than PSI-BLAST and ~20× faster than HMMER3. Jobs to perform HHsearch and HHblits searches with many query profile HMMs can be parallelized over cores and over servers in a cluster using OpenMP and message passing interface (MPI). The free, open-source, GNU GPL(v3)-licensed software is available at https://github.com/soedinglab/hh-suite. Conclusion The added functionalities and increased speed of HHsearch and HHblits should facilitate their use in large-scale protein structure and function prediction, e.g. in metagenomics and genomics projects.

[1]  Johannes Söding,et al.  Protein homology detection by HMM?CHMM comparison , 2005, Bioinform..

[2]  Timothy P. Levine,et al.  Using HHsearch to tackle proteins of unknown function: A pilot study with PH domains , 2016, Traffic.

[3]  Tim J. P. Hubbard,et al.  Data growth and its impact on the SCOP database: new developments , 2007, Nucleic Acids Res..

[4]  N. Grishin,et al.  COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. , 2003, Journal of molecular biology.

[5]  Sean R. Eddy,et al.  Accelerated Profile HMM Searches , 2011, PLoS Comput. Biol..

[6]  Giorgio Valle,et al.  CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment , 2008, BMC Bioinformatics.

[7]  S. Tringe,et al.  Tackling soil diversity with the assembly of large, complex metagenomes , 2014, Proceedings of the National Academy of Sciences.

[8]  Christophe Dessimoz,et al.  SWPS3 – fast multi-threaded vectorized Smith-Waterman for IBM Cell/B.E. and ×86/SSE2 , 2008, BMC Research Notes.

[9]  M. Frith,et al.  Adaptive seeds tame genomic sequence comparison. , 2011, Genome research.

[10]  Torbjørn Rognes,et al.  Six-fold speed-up of Smith-Waterman sequence database searches using parallel processing on common microprocessors , 2000, Bioinform..

[11]  B. Rost Twilight zone of protein sequence alignments. , 1999, Protein engineering.

[12]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[13]  K. Dill,et al.  The Protein-Folding Problem, 50 Years On , 2012, Science.

[14]  Sean R Eddy,et al.  A new generation of homology search tools based on probabilistic inference. , 2009, Genome informatics. International Conference on Genome Informatics.

[15]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[16]  Johannes Söding,et al.  Discriminative modelling of context-specific amino acid substitution probabilities , 2012, Bioinform..

[17]  Jun'ichi Tsujii,et al.  Corpus annotation for mining biomedical events from literature , 2008, BMC Bioinformatics.

[18]  A. Godzik,et al.  Comparison of sequence profiles. Strategies for structural predictions using sequence information , 2008, Protein science : a publication of the Protein Society.

[19]  Johannes Söding,et al.  Protein sequence comparison and fold recognition: progress and good-practice benchmarking. , 2011, Current opinion in structural biology.

[20]  Silvio C. E. Tosatto,et al.  The Pfam protein families database in 2019 , 2018, Nucleic Acids Res..

[21]  Yongchao Liu,et al.  CUDASW++: optimizing Smith-Waterman sequence database searches for CUDA-enabled graphics processing units , 2009, BMC Research Notes.

[22]  Silvio C. E. Tosatto,et al.  InterPro in 2019: improving coverage, classification and access to protein sequence annotations , 2018, Nucleic Acids Res..

[23]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[24]  Marco Biasini,et al.  SWISS-MODEL: modelling protein tertiary and quaternary structure using evolutionary information , 2014, Nucleic Acids Res..

[25]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[26]  Torbjørn Rognes,et al.  Faster Smith-Waterman database searches with inter-sequence SIMD parallelisation , 2011, BMC Bioinformatics.

[27]  Andrzej Wozniak,et al.  Using video-oriented instructions to speed up sequence comparison , 1997, Comput. Appl. Biosci..

[28]  A. Biegert,et al.  HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment , 2011, Nature Methods.

[29]  Jennifer A. Doudna,et al.  New CRISPR-Cas systems from uncultivated microbes , 2016, Nature.

[30]  Mindaugas Margelevicius,et al.  Detection of distant evolutionary relationships between protein families using theory of sequence profile-profile comparison , 2010, BMC Bioinformatics.

[31]  Chao Xie,et al.  Fast and sensitive protein alignment using DIAMOND , 2014, Nature Methods.

[32]  Johannes Söding,et al.  De novo identification of highly diverged protein repeats by probabilistic consistency , 2008, Bioinform..

[33]  Michael Farrar,et al.  Sequence analysis Striped Smith – Waterman speeds database searches six times over other SIMD implementations , 2007 .

[34]  Kevin Truong,et al.  160-fold acceleration of the Smith-Waterman algorithm using a field programmable gate array (FPGA) , 2007, BMC Bioinformatics.

[35]  Wei Zhang,et al.  SP5: Improving Protein Fold Recognition by Using Torsion Angle Profiles and Profile-Based Gap Penalty Model , 2008, PloS one.

[36]  Johannes Söding,et al.  MMseqs2: sensitive protein sequence searching for the analysis of massive data sets , 2017, bioRxiv.

[37]  Maria Jesus Martin,et al.  Uniclust databases of clustered and deeply annotated protein sequences and alignments , 2016, Nucleic Acids Res..