Indexing Factors with Gaps

Abstract Indexing of factors or substrings is a widely used and useful technique in stringology and can be seen as a tool in solving diverse text algorithmic problems. A gapped-factor is a concatenation of a factor of length k, a gap of length d and another factor of length k′. Such a gapped factor is called a (k−d−k′)-gapped-factor. The problem of indexing the gapped-factors was considered recently by Peterlongo et al. (In: Stringology, pp. 182–196, 2006). In particular, Peterlongo et al. devised a data structure, namely a gapped factor tree (GFT) to index the gapped-factors. Given a text $\mathcal{T}$ of length n over the alphabet Σ and the values of the parameters k, d and k′, the construction of GFT requires O(n|Σ|) time. Once GFT is constructed, a given (k−d−k′)-gapped-factor can be reported in O(k+k′+Occ) time, where Occ is the number of occurrences of that factor in  $\mathcal{T}$ . In this paper, we present a new improved indexing scheme for the gapped-factors. The improvements we achieve come from two aspects. Firstly, we generalize the indexing data structure in the sense that, unlike GFT, it is independent of the parameters k and k′. Secondly, our data structure can be constructed in O(nlog 1+εn) time and space, where 0

[1]  Pankaj K. Agarwal,et al.  Range Searching in Categorical Data: Colored Range Searching on Grid , 2002, ESA.

[2]  Chuong B. Do,et al.  Access the most recent version at doi: 10.1101/gr.926603 References , 2003 .

[3]  Erkki Sutinen,et al.  Indexing text with approximate q-grams , 2000, J. Discrete Algorithms.

[4]  M. Farach Optimal suffix tree construction with large alphabets , 1997, Proceedings 38th Annual Symposium on Foundations of Computer Science.

[5]  Wojciech Rytter,et al.  Jewels of stringology : text algorithms , 2002 .

[6]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[7]  Julien Allali,et al.  The at most k-deep factor tree , 2003 .

[8]  Maxime Crochemore,et al.  Longest repeats with a block of k don't cares , 2006, Theor. Comput. Sci..

[9]  Martin Vingron,et al.  SITEBLAST-rapid and sensitive local alignment of genomic sequences employing motif anchors , 2005, Bioinform..

[10]  Stephen Alstrup,et al.  New data structures for orthogonal range searching , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[11]  Dong Kyue Kim,et al.  Constructing suffix arrays in linear time , 2005, J. Discrete Algorithms.

[12]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[13]  Johannes Nowak,et al.  Text indexing with errors , 2007, J. Discrete Algorithms.

[14]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[15]  Enno Ohlebusch,et al.  Efficient multiple genome alignment , 2002, ISMB.

[16]  Srinivas Aluru,et al.  Space efficient linear time construction of suffix arrays , 2005, J. Discrete Algorithms.

[17]  James A. M. McHugh,et al.  A first approach to finding common motifs with gaps , 2005, Int. J. Found. Comput. Sci..

[18]  Richard Cole,et al.  Dictionary matching and indexing with errors and don't cares , 2004, STOC '04.

[19]  Julien Allali,et al.  The gapped-factor tree , 2006, Stringology.

[20]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[21]  Michael Brudno,et al.  Fast and sensitive multiple alignment of large genomic sequences , 2003, BMC Bioinformatics.

[22]  Maxime Crochemore,et al.  Longest Repeats with a Block of Don't Cares , 2004, LATIN.

[23]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[24]  Peter Sanders,et al.  Simple Linear Work Suffix Array Construction , 2003, ICALP.

[25]  Alberto Apostolico,et al.  The Myriad Virtues of Subword Trees , 1985 .

[26]  Erkki Sutinen,et al.  On Using q-Gram Locations in Approximate String Matching , 1995, ESA.

[27]  M. Crochemore,et al.  On-line construction of suffix trees , 2002 .

[28]  Bin Ma,et al.  PatternHunter II: highly sensitive and fast homology search. , 2003, Genome informatics. International Conference on Genome Informatics.

[29]  Costas S. Iliopoulos,et al.  Indexing Factors with Gaps , 2007, SOFSEM.

[30]  Dan Gusfield Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[31]  Costas S. Iliopoulos,et al.  Finding Patterns with Variable Length Gaps or Don't Cares , 2006, COCOON.

[32]  Wojciech Rytter,et al.  Jewels of stringology , 2002 .

[33]  S. Muthukrishnan,et al.  Efficient algorithms for document retrieval problems , 2002, SODA '02.