Optimizing Spaced $k$-mer Neighbors for Efficient Filtration in Protein Similarity Search

Large-scale comparison or similarity search of genomic DNA and protein sequence is of fundamental importance in modern molecular biology. To perform DNA and protein sequence similarity search efficiently, seeding (or filtration) method has been widely used where only sequences sharing a common pattern or “seed” are subject to detailed comparison. Therefore these methods trade search sensitivity with search speed. In this paper, we introduce a new seeding method, called spaced k-mer neighbors, which provides a better tradeoff between the sensitivity and speed in protein sequence similarity search. With the method of spaced k-mer neighbors, for each spaced k-mer, a set of spaced k-mers is selected as its neighbors. These pre-selected spaced k-mer neighbors are then used to detect hits between query sequence and database sequences. We propose an efficient heuristic algorithm for the spaced neighbor selection. Our computational experimental results demonstrate that the method of spaced k-mer neighbors can improve the overall tradeoff efficiency over existing seeding methods.

[1]  Ting Chen,et al.  PerM: efficient mapping of short sequencing reads with periodic full sensitive spaced seeds , 2009, Bioinform..

[2]  Lucian Ilie,et al.  SHRiMP2: Sensitive yet Practical Short Read Mapping , 2011, Bioinform..

[3]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[4]  Daniel G. Brown,et al.  Multiple Vector Seeds for Protein Alignment , 2004, WABI.

[5]  The UniProt Consortium,et al.  Reorganizing the protein space at the Universal Protein Resource (UniProt) , 2011, Nucleic Acids Res..

[6]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[7]  Slawomir Lasota,et al.  Subset Seed Extension to Protein BLAST , 2011, Bioinformatics.

[8]  Bin Ma,et al.  Patternhunter Ii: Highly Sensitive and Fast Homology Search , 2004, J. Bioinform. Comput. Biol..

[9]  Tatiana A. Tatusova,et al.  The National Center for Biotechnology Information's Protein Clusters Database , 2008, Nucleic Acids Res..

[10]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[12]  Bin Ma,et al.  ZOOM! Zillions of oligos mapped , 2008, Bioinform..

[13]  Daniel G. Brown,et al.  Vector seeds: An extension to spaced seeds , 2005, J. Comput. Syst. Sci..

[14]  Webb Miller,et al.  A space-efficient algorithm for local similarities , 1990, Comput. Appl. Biosci..

[15]  Bin Ma,et al.  Efficient filtration for similarity search with spaced k-mer neighbors , 2012, 2012 IEEE International Conference on Bioinformatics and Biomedicine.

[16]  Bin Ma,et al.  A Tutorial of Recent Developments in the Seeding of Local Alignment , 2004, J. Bioinform. Comput. Biol..

[17]  Louxin Zhang,et al.  Superiority and complexity of the spaced seeds , 2006, SODA '06.

[18]  J. Silberg,et al.  A transposase strategy for creating libraries of circularly permuted proteins , 2012, Nucleic acids research.

[19]  Bin Ma,et al.  Amino Acid Classification and Hash Seeds for Homology Search , 2009, BICoB.

[20]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[21]  Bin Ma,et al.  On spaced seeds for similarity search , 2004, Discret. Appl. Math..

[22]  Vincenzo Cutello,et al.  An ant-algorithm for the weighted minimum hitting set problem , 2003, Proceedings of the 2003 IEEE Swarm Intelligence Symposium. SIS'03 (Cat. No.03EX706).