A DNA based Approach to find Closed Repetitive Gapped Subsequences from a Sequence Database

In bioinformatics, the discovery of transcription factor binding affinities is important. This is done by sequence analysis of micro array data. The determination of continuous and gapped motifs accurately from the given long sequence of data, say genetic data is challenging and requires a detailed study. In this paper, we propose an algorithm that can be used for finding short continuous, short gapped, long continuous, long gapped and negative existence of motifs. We propose a new DNA algorithmic approach which solves the accurate determination of motifs continuous and gapped, parallely with optimum time. Using the proposed algorithm, firstly a modified Position Weight Matrix is generated according to the searched motif pattern, which contains the position of its appearance in the given database, using DNA operations. Then, this Position Weight Matrix is used for searching of continuous and gapped subsequences. The proposed algorithm can be used to search genetic, scientific as well as commercial databases. Implementation results showed the correctness of the algorithm. Finally, the validity of the algorithm is checked and its complexity is analyzed. General Terms Sequence Mining, Pattern Recognition, Data Mining.

[1]  Jiawei Han,et al.  BIDE: efficient mining of frequent closed sequences , 2004, Proceedings. 20th International Conference on Data Engineering.

[2]  C. Allis,et al.  In vivo cross-linking and immunoprecipitation for studying dynamic Protein:DNA associations in a chromatin environment. , 1999, Methods.

[3]  Qiming Chen,et al.  PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth , 2001, Proceedings 17th International Conference on Data Engineering.

[4]  Nan Li,et al.  Analysis of computational approaches for motif discovery , 2006, Algorithms for Molecular Biology.

[5]  Olivier Elemento,et al.  DISPARE: DIScriminative PAttern REfinement for Position Weight Matrices , 2009, BMC Bioinformatics.

[6]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[7]  Eleni Stroulia,et al.  From run-time behavior to usage scenarios: an interaction-pattern mining approach , 2002, KDD.

[8]  David Wai-Lok Cheung,et al.  Mining periodic patterns with gap requirement from sequences , 2005, SIGMOD '05.

[9]  Gary D. Stormo,et al.  Identifying DNA and protein patterns with statistically significant alignments of multiple sequences , 1999, Bioinform..

[10]  Mark P. Styczynski,et al.  A generic motif discovery algorithm for sequential data. , 2006, Bioinformatics.

[11]  Jiawei Han,et al.  Efficient Mining of Closed Repetitive Gapped Subsequences from a Sequence Database , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[12]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[13]  Ole Winther,et al.  JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update , 2007, Nucleic Acids Res..

[14]  Saurabh Sinha,et al.  On counting position weight matrix matches in a sequence, with application to discriminative motif finding , 2006, ISMB.

[15]  Jean-Stéphane Varré,et al.  Parallel Position Weight Matrices Algorithms , 2009, 2009 Eighth International Symposium on Parallel and Distributed Computing.

[16]  Xifeng Yan,et al.  CloSpan: Mining Closed Sequential Patterns in Large Datasets , 2003, SDM.

[17]  Ramakrishnan Srikant,et al.  Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.

[18]  Chao Liu,et al.  Efficient mining of iterative patterns for software specification discovery , 2007, KDD '07.

[19]  Marco Furini,et al.  International Journal of Computer and Applications , 2010 .

[20]  Johannes Gehrke,et al.  Sequential PAttern mining using a bitmap representation , 2002, KDD.

[21]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[22]  Alexander E. Kel,et al.  TRANSFAC®: transcriptional regulation, from patterns to profiles , 2003, Nucleic Acids Res..

[23]  R Staden Computer methods to locate signals in nucleic acid sequences , 1984, Nucleic Acids Res..

[24]  Martin Tompa,et al.  An Exact Method for Finding Short Motifs in Sequences, with Application to the Ribosome Binding Site Problem , 1999, ISMB.

[25]  C. Sander,et al.  A database of protein structure families with common folding motifs , 1992, Protein science : a publication of the Protein Society.