On Subset Seeds for Protein Alignment

We apply the concept of subset seeds proposed in [1] to similarity search in protein sequences. The main question studied is the design of efficient seed alphabets to construct seeds with optimal sensitivity/selectivity trade-offs. We propose several different design methods and use them to construct several alphabets. We then perform a comparative analysis of seeds built over those alphabets and compare them with the standard BLASTP seeding method [2], [3], as well as with the family of vector seeds proposed in [4]. While the formalism of subset seeds is less expressive (but less costly to implement) than the cumulative principle used in BLASTP and vector seeds, our seeds show a similar or even better performance than BLASTP on Bernoulli models of proteins compatible with the common BLOSUM62 matrix. Finally, we perform a large-scale benchmarking of our seeds against several main databases of protein alignments. Here again, the results show a comparable or better performance of our seeds versus BLASTP.

[1]  Igor F. Tsigelny Protein Structure Prediction: Bioinformatic Approach , 2002 .

[2]  Daniel G. Brown,et al.  Optimizing multiple seeds for protein homology search , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[3]  Bin Ma,et al.  Optimizing Multiple Spaced Seeds for Homology Search , 2004, CPM.

[4]  Jeremy Buhler,et al.  Designing seeds for similarity search in genomic DNA , 2003, RECOMB '03.

[5]  Gregory Kucherov,et al.  A unifying framework for seed sensitivity and its application to subset seeds , 2006, J. Bioinform. Comput. Biol..

[6]  K. Mizuguchi,et al.  Protein Fold Recognition and Comparative Modelling using HOMSTRAD , JOY and FUGUE , 2004 .

[7]  Jun Wang,et al.  Reduction of protein sequence complexity by residue grouping. , 2003, Protein engineering.

[8]  R. Levy,et al.  Simplified amino acid alphabets for protein fold recognition and implications for folding. , 2000, Protein engineering.

[9]  Kenji Mizuguchi,et al.  HOMSTRAD: recent developments of the Homologous Protein Structure Alignment Database , 2004, Nucleic Acids Res..

[10]  Gajendra P. S. Raghava,et al.  OXBench: A benchmark for evaluation of protein multiple sequence alignment accuracy , 2003, BMC Bioinformatics.

[11]  Robert D. Finn,et al.  Pfam: clans, web tools and services , 2005, Nucleic Acids Res..

[12]  Kun-Mao Chao,et al.  Efficient methods for generating optimal single and multiple spaced seeds , 2004, Proceedings. Fourth IEEE Symposium on Bioinformatics and Bioengineering.

[13]  S. Henikoff,et al.  Automated assembly of protein blocks for database searching. , 1991, Nucleic acids research.

[14]  Daniel G. Brown,et al.  Vector seeds: An extension to spaced seeds , 2005, J. Comput. Syst. Sci..

[15]  Bin Ma,et al.  Seed Optimization Is No Easier than Optimal Golomb Ruler Design , 2007, APBC.

[16]  Peer Bork,et al.  SMART 5: domains in the context of genomes and networks , 2005, Nucleic Acids Res..

[17]  Dominique Lavenier,et al.  Optimal neighborhood indexing for protein similarity search , 2008, BMC Bioinformatics.

[18]  Michael Kaufmann,et al.  BMC Bioinformatics BioMed Central , 2005 .

[19]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[20]  Dominique Lavenier,et al.  Speeding up subset seed algorithm for intensive protein sequence comparison , 2008, 2008 IEEE International Conference on Research, Innovation and Vision for the Future in Computing and Communication Technologies.

[21]  Bin Ma,et al.  tPatternHunter: gapped, fast and sensitive translated homology search , 2005, Bioinform..

[22]  Lucian Ilie,et al.  Long spaced seeds for finding similarities between biological sequences , 2007, BIOCOMP.

[23]  Gregory Kucherov,et al.  YASS: enhancing the sensitivity of DNA similarity search , 2005, Nucleic Acids Res..

[24]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[26]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[27]  Yin-Feng Xu,et al.  Constrained Independence System and Triangulations of Planar Point Sets , 1995, COCOON.

[28]  Bin Ma,et al.  Patternhunter Ii: Highly Sensitive and Fast Homology Search , 2004, J. Bioinform. Comput. Biol..

[29]  Dominique Lavenier,et al.  Protein Similarity Search with Subset Seeds on a Dedicated Reconfigurable Hardware , 2007, PPAM.

[30]  Leming Zhou,et al.  Universal seeds for cDNA-to-genome comparison , 2007, BMC Bioinformatics.

[31]  Bin Ma,et al.  Rapid Homology Search with Neighbor Seeds , 2007, Algorithmica.

[32]  Gary Benson,et al.  Indel seeds for homology search , 2006, ISMB.

[33]  Alejandro A. Schäffer,et al.  Improved BLAST searches using longer words for protein seeding , 2007, Bioinform..

[34]  Jeremy Buhler,et al.  Designing multiple simultaneous seeds for DNA similarity search , 2004, J. Comput. Biol..

[35]  Gregory Kucherov,et al.  Improved hit criteria for DNA local alignment , 2004, BMC Bioinformatics.

[36]  Olivier Poch,et al.  BAliBASE (Benchmark Alignment dataBASE): enhancements for repeats, transmembrane sequences and circular permutations , 2001, Nucleic Acids Res..

[37]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[38]  Bin Ma,et al.  On spaced seeds for similarity search , 2004, Discret. Appl. Math..