Rapid similarity search of proteins using alignments of domain arrangements

MOTIVATION Homology search methods are dominated by the central paradigm that sequence similarity is a proxy for common ancestry and, by extension, functional similarity. For determining sequence similarity in proteins, most widely used methods use models of sequence evolution and compare amino-acid strings in search for conserved linear stretches. Probabilistic models or sequence profiles capture the position-specific variation in an alignment of homologous sequences and can identify conserved motifs or domains. While profile-based search methods are generally more accurate than simple sequence comparison methods, they tend to be computationally more demanding. In recent years, several methods have emerged that perform protein similarity searches based on domain composition. However, few methods have considered the linear arrangements of domains when conducting similarity searches, despite strong evidence that domain order can harbour considerable functional and evolutionary signal. RESULTS Here, we introduce an alignment scheme that uses a classical dynamic programming approach to the global alignment of domains. We illustrate that representing proteins as strings of domains (domain arrangements) and comparing these strings globally allows for a both fast and sensitive homology search. Further, we demonstrate that the presented methods complement existing methods by finding similar proteins missed by popular amino-acid-based comparison methods. AVAILABILITY An implementation of the presented algorithms, a web-based interface as well as a command-line program for batch searching against the UniProt database can be found at http://rads.uni-muenster.de. Furthermore, we provide a JAVA API for programmatic access to domain-string–based search methods.

[1]  Thomas L. Madden,et al.  Domain enhanced lookup time accelerated BLAST , 2012, Biology Direct.

[2]  Michael Kaufmann,et al.  BMC Bioinformatics BioMed Central , 2005 .

[3]  Sean R. Eddy,et al.  Hidden Markov model speed heuristic and iterative HMM search procedure , 2010, BMC Bioinformatics.

[4]  Erich Bornberg-Bauer,et al.  DoMosaics: software for domain arrangement visualization and domain-centric analysis of proteins , 2014, Bioinform..

[5]  Anton J. Enright,et al.  GeneRAGE: a robust algorithm for sequence clustering and domain detection , 2000, Bioinform..

[6]  Sean R. Eddy,et al.  Accelerated Profile HMM Searches , 2011, PLoS Comput. Biol..

[7]  Erich Bornberg-Bauer,et al.  Automated Improvement of Domain ANnotations using context analysis of domain arrangements (AIDAN) , 2007, Bioinform..

[8]  Erich Bornberg-Bauer,et al.  Functional and Evolutionary Insights from the Genomes of Three Parasitoid Nasonia Species , 2010, Science.

[9]  Andrew D. Moore,et al.  Arrangements in the modular evolution of proteins. , 2008, Trends in biochemical sciences.

[10]  J. Risler,et al.  Identification of genomic features using microsyntenies of domains: domain teams. , 2005, Genome research.

[11]  Juliane C. Dohm,et al.  Multiple platform assessment of the EGF dependent transcriptome by microarray and deep tag sequencing analysis , 2011, BMC Genomics.

[12]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Narmada Thanki,et al.  CDD: a conserved domain database for interactive domain family analysis , 2006, Nucleic Acids Res..

[14]  Gustavo Caetano-Anollés,et al.  The evolutionary mechanics of domain organization in proteomes and the rise of modularity in the protein world. , 2009, Structure.

[15]  Dannie Durand,et al.  Sequence Similarity Network Reveals Common Ancestry of Multidomain Proteins , 2008, PLoS Comput. Biol..

[16]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[17]  Gaston H. Gonnet,et al.  OMA 2011: orthology inference among 1000 complete genomes , 2010, Nucleic Acids Res..

[18]  Kimmen Sjölander,et al.  Ortholog identification in the presence of domain architecture rearrangement , 2011, Briefings Bioinform..

[19]  Byungwook Lee,et al.  DAhunter: a web-based server that identifies homologous proteins by comparing domain architecture , 2008, Nucleic Acids Res..

[20]  E. Bornberg-Bauer,et al.  Domain deletions and substitutions in the modular protein evolution , 2006, The FEBS journal.

[21]  W. Pearson,et al.  Homologous over-extension: a challenge for iterative similarity searches , 2010, Nucleic acids research.

[22]  Erich Bornberg-Bauer,et al.  Rapid motif-based prediction of circular permutations in multi-domain proteins , 2005, Bioinform..

[23]  S. Bryant,et al.  CDART: protein homology by domain architecture. , 2002, Genome research.

[24]  The UniProt Consortium,et al.  Reorganizing the protein space at the Universal Protein Resource (UniProt) , 2011, Nucleic Acids Res..

[25]  A. Elofsson,et al.  Domain rearrangements in protein evolution. , 2005, Journal of molecular biology.

[26]  S. Karlin,et al.  Applications and statistics for multiple high-scoring segments in molecular sequences. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[27]  Cyrus Chothia,et al.  Genomic and structural aspects of protein evolution. , 2009, The Biochemical journal.

[28]  Robert D. Finn,et al.  InterPro in 2011: new developments in the family and domain prediction database , 2011, Nucleic acids research.

[29]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[30]  Thomas L. Madden,et al.  Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. , 2001, Nucleic acids research.

[31]  Olivier Gascuel,et al.  Detection of new protein domains using co-occurrence: application to Plasmodium falciparum , 2009, Bioinform..

[32]  Erik L. L. Sonnhammer,et al.  Domain architecture conservation in orthologs , 2011, BMC Bioinformatics.

[33]  Sarah A. Teichmann,et al.  Protein domain organisation: adding order , 2009, BMC Bioinformatics.

[34]  Alex Bateman,et al.  Quantifying the mechanisms of domain gain in animal proteins , 2010, Genome Biology.

[35]  Doheon Lee,et al.  Protein , 2005, The Lancet.

[36]  J. Tcherkezian,et al.  Current knowledge of the large RhoGAP family of proteins , 2007, Biology of the cell.

[37]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[38]  M. Gerstein,et al.  Annotation transfer for genomics: measuring functional divergence in multi-domain proteins. , 2001, Genome research.

[39]  Martin Vingron,et al.  Statistics of large scale sequence searching , 1997, German Conference on Bioinformatics.

[40]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[41]  J. Silberg,et al.  A transposase strategy for creating libraries of circularly permuted proteins , 2012, Nucleic acids research.

[42]  Lei Zhu,et al.  An initial strategy for comparing proteins at the domain architecture level , 2006, Bioinform..

[43]  M. Gerstein,et al.  Annotation Transfer for Genomics: Measuring Functional Divergence in Multi-Domain Proteins , 2001, Genome Research.

[44]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.