Applying Agrep to r-NSA to solve multiple sequences approximate matching

This paper addresses the approximate matching problem in a database consisting of multiple DNA sequences, where the proposed approach applies Agrep to a new truncated suffix array, r-NSA. The construction time of the structure is linear to the database size, and the computations of indexing a substring in the structure are constant. The number of characters processed in applying Agrep is analysed theoretically, and the theoretical upper-bound can approximate closely the empirical number of characters, which is obtained through enumerating the characters in the actual structure built. Experiments are carried out using (synthetic) random DNA sequences, as well as (real) genome sequences including Hepatitis-B Virus and X-chromosome. Experimental results show that, compared to the straight-forward approach that applies Agrep to multiple sequences individually, the proposed approach solves the matching problem in much shorter time. The speed-up of our approach depends on the sequence patterns, and for highly similar homologous genome sequences, which are the common cases in real-life genomes, it can be up to several orders of magnitude.

[1]  Richard Cole,et al.  Approximate string matching: a simpler faster algorithm , 2002, SODA '98.

[2]  Alain Viari,et al.  Searching for Repeated Words in a Text Allowing for Mismatches and Gaps , 1995 .

[3]  Kwong-Sak Leung,et al.  A generalized sequence pattern matching algorithm using complementary dual-seeding , 2010, 2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[4]  Gang Li,et al.  Discovering multiple realistic TFBS motifs based on a generalized model , 2009, BMC Bioinformatics.

[5]  Jijun Tang,et al.  A space-efficient algorithm for three sequence alignment and ancestor inference , 2009, Int. J. Data Min. Bioinform..

[6]  Gonzalo Navarro,et al.  Faster Approximate String Matching , 1999, Algorithmica.

[7]  Kwong-Sak Leung,et al.  N-SAMSAM : A simple and faster algorithm for solving approximate matching in DNA sequences , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[8]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[9]  Min Song,et al.  Detecting duplicate biological entities using Shortest Path Edit Distance , 2010, Int. J. Data Min. Bioinform..

[10]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[11]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[12]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[13]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[14]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[15]  Marcel H. Schulz,et al.  The generalised k-Truncated Suffix Tree for time-and space-efficient searches in multiple DNA or protein sequences , 2008, Int. J. Bioinform. Res. Appl..

[16]  Gonzalo Navarro,et al.  Average-Optimal Multiple Approximate String Matching , 2003, CPM.

[17]  Paul Heckel,et al.  A technique for isolating differences between files , 1978, CACM.

[18]  Esko Ukkonen,et al.  Approximate String-Matching over Suffix Trees , 1993, CPM.

[19]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[20]  Eugene H. Spafford,et al.  A PATTERN MATCHING MODEL FOR MISUSE INTRUSION DETECTION , 1994 .

[21]  William F. Smyth,et al.  A taxonomy of suffix array construction algorithms , 2007, CSUR.

[22]  Hong Yan,et al.  Spectral similarity for analysis of DNA microarray time-series data , 2006, Int. J. Data Min. Bioinform..

[23]  Francis Y. L. Chin,et al.  An efficient motif discovery algorithm with unknown motif length and number of binding sites , 2006, Int. J. Data Min. Bioinform..

[24]  Kwong-Sak Leung,et al.  TFBS identification based on genetic algorithm with combined representations and adaptive post-processing , 2008, Bioinform..

[25]  Uzi Vishkin,et al.  Efficient approximate and dynamic matching of patterns using a labeling paradigm , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[26]  Philip S. Yu,et al.  Accelerating approximate subsequence search on large protein sequence databases , 2002, Proceedings. IEEE Computer Society Bioinformatics Conference.

[27]  Udi Manber,et al.  Fast Text Searching With Errors , 2005 .

[28]  Mohammed J. Zaki,et al.  Genome-scale disk-based suffix tree indexing , 2007, SIGMOD '07.

[29]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[30]  Richard W. Hamming,et al.  Error detecting and error correcting codes , 1950 .

[31]  Christopher D. Carothers,et al.  VOGUE: A variable order hidden Markov model with duration based on frequent sequence mining , 2010, TKDD.

[32]  Enno Ohlebusch,et al.  Replacing suffix trees with enhanced suffix arrays , 2004, J. Discrete Algorithms.

[33]  Marie-France Sagot,et al.  Spelling Approximate Repeated or Common Motifs Using a Suffix Tree , 1998, LATIN.

[34]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.