The gapped-factor tree

We present a data structure to index a specific kind of factors, that is of substrings, called gapped-factors. A gapped-factor is a factor containing a gap that is ignored during the indexation. The data structure presented is based on the suffix tree and indexes all the gapped-factors of a text with a fixed size of gap, and only those. The construction of this data structure is done online in linear time and space. Such a data structure may play an important role in various pattern matching and motif inference problems, for instance in text filtration.

[1]  Kim R. Rasmussen,et al.  Efficient q-Gram Filters for Finding All-Matches Over a Given Length , 2005 .

[2]  Erkki Sutinen,et al.  Indexing text with approximate q-grams , 2000, J. Discrete Algorithms.

[3]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[4]  Chuong B. Do,et al.  Access the most recent version at doi: 10.1101/gr.926603 References , 2003 .

[5]  Joong Chae Na,et al.  Data compression with truncated suffix trees , 2000, Proceedings DCC 2000. Data Compression Conference.

[6]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[7]  G. Kucherov,et al.  Multiseed lossless filtration , 2009, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[8]  Marie-France Sagot,et al.  Extracting structured motifs using a suffix tree—algorithms and application to promoter consensus identification , 2000, RECOMB '00.

[9]  Erkki Sutinen,et al.  On Using q-Gram Locations in Approximate String Matching , 1995, ESA.

[10]  Maxime Crochemore,et al.  Algorithms on strings , 2007 .

[11]  Simon de Givry,et al.  Combining constraint processing and pattern matching to describe and locate structured motifs in genomic sequences. , 2005, IJCAI 2005.

[12]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[13]  Enno Ohlebusch,et al.  Efficient multiple genome alignment , 2002, ISMB.

[14]  Joong Chae Na,et al.  Truncated suffix trees and their application to data compression , 2003, Theor. Comput. Sci..

[15]  Marie-France Sagot,et al.  RISOTTO: Fast Extraction of Motifs with Mismatches , 2006, LATIN.

[16]  Martin Vingron,et al.  q-gram based database searching using a suffix array (QUASAR) , 1999, RECOMB.

[17]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[18]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[19]  Michael Brudno,et al.  Fast and sensitive multiple alignment of large genomic sequences , 2003, BMC Bioinformatics.

[20]  James A. M. McHugh,et al.  A first approach to finding common motifs with gaps , 2005, Int. J. Found. Comput. Sci..

[21]  Juha Kärkkäinen Computing the Threshold for q-Gram Filters , 2002, SWAT.

[22]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[23]  Pavel A. Pevzner,et al.  Multiple filtration and approximate pattern matching , 1995, Algorithmica.

[24]  Martin Vingron,et al.  SITEBLAST-rapid and sensitive local alignment of genomic sequences employing motif anchors , 2005, Bioinform..

[25]  Frédéric Boyer,et al.  Lossless Filter for Finding Long Multiple Approximate Repetitions Using a New Data Structure, the Bi-factor Array , 2005, SPIRE.

[26]  Juha Kärkkäinen,et al.  One-Gapped q-Gram Filtersfor Levenshtein Distance , 2002, CPM.

[27]  D. Lipman,et al.  Rapid and sensitive protein similarity searches. , 1985, Science.

[28]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[29]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[30]  Juha Kärkkäinen,et al.  Better Filtering with Gapped q-Grams , 2001, Fundam. Informaticae.

[31]  Bin Ma,et al.  Patternhunter Ii: Highly Sensitive and Fast Homology Search , 2004, J. Bioinform. Comput. Biol..

[32]  Julien Allali,et al.  The at most k-deep factor tree , 2003 .

[33]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.