A generalized sequence pattern matching algorithm using complementary dual-seeding

In this work, we define generalized (sequence) patterns, which is based on several real Biological problems, including transcription factors (TFs) binding to transcription factor binding sites (TFBSs), cis-regulatory modules, protein domain analysis, and alternative splicing etc. Simply speaking, a generalized pattern is composed of several substrings with gaps in-between two substrings. We propose a generalized pattern matching algorithm that uses a complementary dualseeding strategy, which is sensitive to errors (both mismatches and indels). We also develop a generalized pattern matching tool1, which is to our knowledge the first ever developed specially for generalized pattern matching. Rather than replacing the existing general purpose matching tools, such as BLAST, BLAT, and PatternHunter etc, our tool provides an alternative and helps users to solve real problems, especially those that can be modeled as generalized patterns. We use data randomly sampled from reference sequences of human genome (NCBI build v18) in experiments, and hit 98.74% generalized patterns on average. The tool runs on both LINUX and Windows platforms, and the memory peak goes to a little bit larger than 1GB only.

[1]  Bin Ma,et al.  Patternhunter Ii: Highly Sensitive and Fast Homology Search , 2004, J. Bioinform. Comput. Biol..

[2]  Kwong-Sak Leung,et al.  N-SAMSAM : A simple and faster algorithm for solving approximate matching in DNA sequences , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[3]  Siu-Ming Yiu,et al.  Detection of generic spaced motifs using submotif pattern mining , 2007, Bioinform..

[4]  Jeremy Buhler,et al.  Designing multiple simultaneous seeds for DNA similarity search , 2004, J. Comput. Biol..

[5]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[6]  Daniel G. Brown,et al.  A Survey of Seeding for Sequence Alignment , 2007 .

[7]  Jeremy Buhler,et al.  Designing seeds for similarity search in genomic DNA , 2003, RECOMB '03.

[8]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[9]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[10]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[11]  Marcel H. Schulz,et al.  The generalised k-Truncated Suffix Tree for time-and space-efficient searches in multiple DNA or protein sequences , 2008, Int. J. Bioinform. Res. Appl..