A New Approach for Similarity Queries of Biological Sequences in Databases

As biological databases grow larger, effective query of the biological sequences in these databases has become an increasingly important issue for researchers. There are currently not many systems for fast access of very large biological sequences. In this paper, we propose a new approach for biological sequences similarity querying in databases. The general idea is to first transform the biological sequences into vectors and then onto 2-d points in planes; then use a spatial index to index these points with self-organizing maps (SOM), and perform a single efficient similarity query (with multiple simultaneous input sequences) using a fast algorithm, the multi-point range query (MPRQ) algorithm. This approach works well because we could perform multiple sequences similarity queries and return the results with just one MPRQ query, with tremendous savings in query time. We applied our method onto DNA and protein sequences in database, and results show that our algorithm is efficient in time, and the accuracies are satisfactory.

[1]  P. Bertone,et al.  Integrative data mining: the new direction in bioinformatics , 2001, IEEE Engineering in Medicine and Biology Magazine.

[2]  Hon Wai Leong,et al.  Multi-point queries in large spatial databases , 2007 .

[3]  Thomas L. Madden,et al.  BLAST: at the core of a powerful and diverse set of sequence analysis tools , 2004, Nucleic Acids Res..

[4]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[5]  Hon Wai Leong,et al.  Efficient algorithm for path-based range query in spatial databases , 2004, Proceedings. International Database Engineering and Applications Symposium, 2004. IDEAS '04..

[6]  Mario A. López,et al.  A greedy algorithm for bulk loading R-trees , 1998, GIS '98.

[7]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[8]  Sanghyuk Lee,et al.  ChimerDB—a knowledgebase for fusion sequences , 2005, Nucleic Acids Res..

[9]  Jorma Laaksonen,et al.  SOM_PAK: The Self-Organizing Map Program Package , 1996 .

[10]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[11]  David J. States,et al.  Identification of protein coding regions by database similarity search , 1993, Nature Genetics.

[12]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.