An Improved Fast Search Method Using Histogram Features for DNA Sequence Database

In this paper, we propose an efficient hierarchical DNA sequence search method to improve the search speed while the accuracy is being kept constant. For a given query DNA sequence, firstly, a fast local search method using histogram features is used as a filtering mechanism before scanning the sequences in the database. An overlapping processing is newly added to improve the robustness of the algorithm. A large number of DNA sequences with low similarity will be excluded for latter searching. The Smith-Waterman algorithm is then applied to each remainder sequences. Experimental results using GenBank sequence data show the proposed method combining histogram information and Smith-Waterman algorithm is more efficient for DNA sequence search. Keywords—Fast search, DNA sequence, Histogram feature, Smith-Waterman algorithm, Local search

[1]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[2]  F. Collins,et al.  The Human Genome Project: Lessons from Large-Scale Biology , 2003, Science.

[3]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[4]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[5]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[6]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[7]  D. Lipman,et al.  Rapid and sensitive protein similarity searches. , 1985, Science.

[8]  Qiu Chen,et al.  A Fast Retrieval of DNA Sequences Using Histogram Information , 2009, 2009 Second International Conference on Future Information Technology and Management Engineering.

[9]  Bin Ma,et al.  PatternHunter II: highly sensitive and fast homology search. , 2003, Genome informatics. International Conference on Genome Informatics.