Efficient alternatives to PSI-BLAST

In this paper we present two algorithms that may serve as e fficient alternatives to the well-known PSI BLAST tool: SeedBLAST and CTX-PSI Blast. Both may bene t from the knowledge about amino acid composition speci c to a given protein family: SeedBLAST uses a advisedly designed seed, while CTX-PSI BLAST extends PSI BLAST with the context-speci c substitution model. The seeding technique became central in the theory of sequence alignment. There are several e cient tools applying seeds to DNA homology search, but not to protein homology search. In this paper we ll this gap. We advocate the use of multiple subset seeds derived from a hierarchical tree of amino acid residues. Our method computes, by an evolutionary algorithm, seeds that are specifi cally designed for a given protein family. The seeds are represented by deterministic finite automata (DFAs) and built into the NCBI-BLAST software. This extended tool, named SeedBLAST, is compared to the original BLAST and PSI-BLAST on several protein families. Our results demonstrate a superiority of SeedBLAST in terms of e ciency, especially in the case of twilight zone hits. The contextual substitution model has been proven to increase sensitivity of protein alignment. In this paper we perform a next step in the contextual alignment program. We announce a contextual version of the PSI-BLAST algorithm, an iterative version of the NCBI-BLAST tool. The experimental evaluation has been performed demonstrating a signi cantly higher sensitivity compared to the ordinary PSI-BLAST algorithm.

[1]  S. Altschul,et al.  The estimation of statistical parameters for local alignment score distributions. , 2001, Nucleic acids research.

[2]  Anna Gambin Substitution Matrices for Contextual Alignment , 2002 .

[3]  Bin Ma,et al.  PatternHunter II: highly sensitive and fast homology search. , 2003, Genome informatics. International Conference on Genome Informatics.

[4]  M. Frith,et al.  Adaptive seeds tame genomic sequence comparison. , 2011, Genome research.

[5]  Jerzy Tiuryn,et al.  Contextual alignment of biological sequences , 2002, ECCB.

[6]  Anna Gambin,et al.  CTX-BLAST: context sensitive version of protein BLAST , 2007, Bioinform..

[7]  Jun Wang,et al.  Reduction of protein sequence complexity by residue grouping. , 2003, Protein engineering.

[8]  Gregory Kucherov,et al.  A unifying framework for seed sensitivity and its application to subset seeds , 2006, J. Bioinform. Comput. Biol..

[9]  Jeremy Buhler,et al.  Designing multiple simultaneous seeds for DNA similarity search , 2004, J. Comput. Biol..

[10]  Franklin Mark Liang Word hy-phen-a-tion by com-put-er , 1983 .

[11]  Jeremy Buhler,et al.  Designing seeds for similarity search in genomic DNA , 2003, RECOMB '03.

[12]  Hugh E. Williams,et al.  A Deterministic Finite Automaton for Faster Protein Hit Detection in BLAST , 2006, J. Comput. Biol..

[13]  Bin Ma,et al.  Amino Acid Classification and Hash Seeds for Homology Search , 2009, BICoB.

[14]  Bin Ma,et al.  tPatternHunter: gapped, fast and sensitive translated homology search , 2005, Bioinform..

[15]  Gert Vriend,et al.  A common motif in G-protein-coupled seven transmembrane helix receptors , 1993, J. Comput. Aided Mol. Des..

[16]  Geoffrey J. Barton,et al.  Protein sequence alignments: a strategy for the hierarchical analysis of residue conservation , 1993, Comput. Appl. Biosci..

[17]  Dominique Lavenier,et al.  Speeding up subset seed algorithm for intensive protein sequence comparison , 2008, 2008 IEEE International Conference on Research, Innovation and Vision for the Future in Computing and Communication Technologies.

[18]  R. Levy,et al.  Simplified amino acid alphabets for protein fold recognition and implications for folding. , 2000, Protein engineering.

[19]  Daniel G. Brown,et al.  Optimal Spaced Seeds for Homologous Coding Regions , 2004, J. Bioinform. Comput. Biol..

[20]  B. Rost Twilight zone of protein sequence alignments. , 1999, Protein engineering.

[21]  Dorothea Heiss-Czedik,et al.  An Introduction to Genetic Algorithms. , 1997, Artificial Life.

[22]  Yin-Feng Xu,et al.  Constrained Independence System and Triangulations of Planar Point Sets , 1995, COCOON.

[23]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[24]  Yann Ponty,et al.  GenRGenS: software for generating random genomic sequences and structures , 2006, Bioinform..

[25]  Dominique Lavenier,et al.  Protein Similarity Search with Subset Seeds on a Dedicated Reconfigurable Hardware , 2007, PPAM.

[26]  Jerzy Tiuryn,et al.  Alignment with Context Dependent Scoring Function , 2006, J. Comput. Biol..

[27]  Bin Ma,et al.  Seed Optimization Is No Easier than Optimal Golomb Ruler Design , 2007, APBC.

[28]  Ambuj K. Singh,et al.  Efficient Index Structures for String Databases , 2001, VLDB.

[29]  Daniel G. Brown,et al.  Vector seeds: An extension to spaced seeds , 2005, J. Comput. Syst. Sci..

[30]  Tamer Kahveci,et al.  An Efficient Index Structure for String Databases , 2001 .

[31]  G. Kucherov,et al.  Multiseed lossless filtration , 2009, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[32]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[33]  H. Margalit,et al.  Evaluation of PSI‐BLAST alignment accuracy in comparison to structural alignments , 2000, Protein science : a publication of the Protein Society.

[34]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[35]  Alejandro A. Schäffer,et al.  Improved BLAST searches using longer words for protein seeding , 2007, Bioinform..

[36]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[37]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[38]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[39]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[40]  B. Korte,et al.  An Analysis of the Greedy Heuristic for Independence Systems , 1978 .

[41]  Gregory Kucherov,et al.  YASS: enhancing the sensitivity of DNA similarity search , 2005, Nucleic Acids Res..

[42]  S F Altschul,et al.  Local alignment statistics. , 1996, Methods in enzymology.

[43]  Kun-Mao Chao,et al.  Efficient methods for generating optimal single and multiple spaced seeds , 2004, Proceedings. Fourth IEEE Symposium on Bioinformatics and Bioengineering.

[44]  Slawomir Lasota,et al.  Subset Seed Extension to Protein BLAST , 2011, Bioinformatics.

[45]  A. Gambin,et al.  On Subset Seeds for Protein Alignment , 2009, IEEE/ACM Transactions on Computational Biology and Bioinformatics.