论文信息 - On Subset Seeds for Protein Alignment

On Subset Seeds for Protein Alignment

We apply the concept of subset seeds proposed in [1] to similarity search in protein sequences. The main question studied is the design of efficient seed alphabets to construct seeds with optimal sensitivity/selectivity trade-offs. We propose several different design methods and use them to construct several alphabets. We then perform a comparative analysis of seeds built over those alphabets and compare them with the standard BLASTP seeding method [2], [3], as well as with the family of vector seeds proposed in [4]. While the formalism of subset seeds is less expressive (but less costly to implement) than the cumulative principle used in BLASTP and vector seeds, our seeds show a similar or even better performance than BLASTP on Bernoulli models of proteins compatible with the common BLOSUM62 matrix. Finally, we perform a large-scale benchmarking of our seeds against several main databases of protein alignments. Here again, the results show a comparable or better performance of our seeds versus BLASTP.

[1] Igor F. Tsigelny. Protein Structure Prediction: Bioinformatic Approach , 2002 .

[2] Daniel G. Brown,et al. Optimizing multiple seeds for protein homology search , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[3] Bin Ma,et al. Optimizing Multiple Spaced Seeds for Homology Search , 2004, CPM.

[4] Jeremy Buhler,et al. Designing seeds for similarity search in genomic DNA , 2003, RECOMB '03.

[5] Gregory Kucherov,et al. A unifying framework for seed sensitivity and its application to subset seeds , 2006, J. Bioinform. Comput. Biol..

[6] K. Mizuguchi,et al. Protein Fold Recognition and Comparative Modelling using HOMSTRAD , JOY and FUGUE , 2004 .

[7] Jun Wang,et al. Reduction of protein sequence complexity by residue grouping. , 2003, Protein engineering.

[8] R. Levy,et al. Simplified amino acid alphabets for protein fold recognition and implications for folding. , 2000, Protein engineering.

[9] Kenji Mizuguchi,et al. HOMSTRAD: recent developments of the Homologous Protein Structure Alignment Database , 2004, Nucleic Acids Res..

[10] Gajendra P. S. Raghava,et al. OXBench: A benchmark for evaluation of protein multiple sequence alignment accuracy , 2003, BMC Bioinformatics.

[11] Robert D. Finn,et al. Pfam: clans, web tools and services , 2005, Nucleic Acids Res..

[12] Kun-Mao Chao,et al. Efficient methods for generating optimal single and multiple spaced seeds , 2004, Proceedings. Fourth IEEE Symposium on Bioinformatics and Bioengineering.

[13] S. Henikoff,et al. Automated assembly of protein blocks for database searching. , 1991, Nucleic acids research.

[14] Daniel G. Brown,et al. Vector seeds: An extension to spaced seeds , 2005, J. Comput. Syst. Sci..

[15] Bin Ma,et al. Seed Optimization Is No Easier than Optimal Golomb Ruler Design , 2007, APBC.

[16] Peer Bork,et al. SMART 5: domains in the context of genomes and networks , 2005, Nucleic Acids Res..

[17] Dominique Lavenier,et al. Optimal neighborhood indexing for protein similarity search , 2008, BMC Bioinformatics.

[18] Michael Kaufmann,et al. BMC Bioinformatics BioMed Central , 2005 .

[19] Bin Ma,et al. PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[20] Dominique Lavenier,et al. Speeding up subset seed algorithm for intensive protein sequence comparison , 2008, 2008 IEEE International Conference on Research, Innovation and Vision for the Future in Computing and Communication Technologies.

[21] Bin Ma,et al. tPatternHunter: gapped, fast and sensitive translated homology search , 2005, Bioinform..

[22] Lucian Ilie,et al. Long spaced seeds for finding similarities between biological sequences , 2007, BIOCOMP.

[23] Gregory Kucherov,et al. YASS: enhancing the sensitivity of DNA similarity search , 2005, Nucleic Acids Res..

[24] S. Henikoff,et al. Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[25] Robert C. Edgar,et al. MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[26] E. Myers,et al. Basic local alignment search tool. , 1990, Journal of molecular biology.

[27] Yin-Feng Xu,et al. Constrained Independence System and Triangulations of Planar Point Sets , 1995, COCOON.

[28] Bin Ma,et al. Patternhunter Ii: Highly Sensitive and Fast Homology Search , 2004, J. Bioinform. Comput. Biol..

[29] Dominique Lavenier,et al. Protein Similarity Search with Subset Seeds on a Dedicated Reconfigurable Hardware , 2007, PPAM.

[30] Leming Zhou,et al. Universal seeds for cDNA-to-genome comparison , 2007, BMC Bioinformatics.

[31] Bin Ma,et al. Rapid Homology Search with Neighbor Seeds , 2007, Algorithmica.

[32] Gary Benson,et al. Indel seeds for homology search , 2006, ISMB.

[33] Alejandro A. Schäffer,et al. Improved BLAST searches using longer words for protein seeding , 2007, Bioinform..

[34] Jeremy Buhler,et al. Designing multiple simultaneous seeds for DNA similarity search , 2004, J. Comput. Biol..

[35] Gregory Kucherov,et al. Improved hit criteria for DNA local alignment , 2004, BMC Bioinformatics.

[36] Olivier Poch,et al. BAliBASE (Benchmark Alignment dataBASE): enhancements for repeats, transmembrane sequences and circular permutations , 2001, Nucleic Acids Res..

[37] W. J. Kent,et al. BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[38] Bin Ma,et al. On spaced seeds for similarity search , 2004, Discret. Appl. Math..