MINING SEQUENCE MOTIFS FROM PROTEIN DATABASES BASED ON A BIT PATTERN APPROACH

Proteins are the structural components of living cells and tissues, and thus an important building block in all living organisms. Sequence motifs in proteins are some subsequences which appear frequently. Motifs often denote important functional regions in proteins and can be used to characterize a protein family or discover the function of proteins. The SP-index algorithm was proposed tond sequence motifs containing gaps of arbitrary size. Tond motifs, it constructs B-trees for indexing the occurring positions of short segments. Then, to check whether a long pattern composed of short segments appears frequently, the SP-index algorithm needs to test a large number of nodes of those B-trees, which may not be efficient. Therefore, in this paper, we propose the Bit- Pattern-based (BP) algorithm to improve the efficiency of the SP-index algorithm. First, the BP algorithm transforms the protein sequences into bit patterns. Then, instead of testing a large number of nodes in the SP-index algorithm, the BP algorithm utilizes bit operations, i.e., AND, OR, shifting and masking, to efficientlynd sequence motifs. The BP algorithm also performs a pruning step to reduce the processing time. From the experimental results on biological and synthetic data sets, we show that the BP algorithm needs shorter processing time than the SP-index algorithm.

[1]  Paulo J. Azevedo,et al.  Evaluating deterministic motif significance measures in protein databases , 2007, Algorithms for Molecular Biology.

[2]  Jeffrey Xu Yu,et al.  Scalable sequential pattern mining for biological sequences , 2004, CIKM '04.

[3]  Paulo J. Azevedo,et al.  Query Driven Sequence Pattern Mining , 2006, SBBD.

[4]  Johannes Gehrke,et al.  Sequential PAttern mining using a bitmap representation , 2002, KDD.

[5]  Jitender S. Deogun,et al.  A New Scheme for Protein Sequence Motif Extraction , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.

[6]  Wei Li,et al.  Mining functional associated patterns from biological network data , 2009, SAC '09.

[7]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[8]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[9]  Valerie Guralnik,et al.  A scalable algorithm for clustering protein sequences , 2001, BIOKDD.

[10]  Qiming Chen,et al.  PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth , 2001, Proceedings 17th International Conference on Data Engineering.

[11]  Zhifang Sui,et al.  Extracting Hyponymy Relation between Chinese Terms , 2008, AIRS.

[12]  David R. Gilbert,et al.  Approaches to the Automatic Discovery of Patterns in Biosequences , 1998, J. Comput. Biol..

[13]  Kuen-Fang Jea,et al.  MINING HYBRID SEQUENTIAL PATTERNS BY HIERARCHICAL MINING TECHNIQUE , 2009 .