Better Filtering with Gapped q-Grams

A popular and well-studied class of filters for approximate string matching compares substrings of length q, the q-grams, in the pattern and the text to identify text areas that contain potential matches. A generalization of the method that uses gapped q-grams instead of contiguous substrings is mentioned a few times in literature but has never been analyzed in any depth. In this paper, we report the first results of a study on gapped q-grams. We show that gapped q-grams can provide orders of magnitude faster and/or more efficient filtering than contiguous q-grams. To achieve these results the arrangement of the gaps in the q-gram and a filter parameter called threshold have to be optimized. Both of these tasks are nontrivial combinatorial optimization problems for which we present efficient solutions. We concentrate on the k mismatches problem, i.e, approximate string matching with the Hamming distance.

[1]  Juha Kärkkäinen Computing the Threshold for q-Gram Filters , 2002, SWAT.

[2]  Stefan Burkhardt Filter algorithms for approximate string matching , 2002 .

[3]  Tadao Takaoka,et al.  Approximate Pattern Matching with Samples , 1994, ISAAC.

[4]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[5]  Esko Ukkonen,et al.  Two Algorithms for Approximate String Matching in Static Texts , 1991, MFCS.

[6]  Esko Ukkonen,et al.  Algorithms for Approximate String Matching , 1985, Inf. Control..

[7]  Eugene W. Myers,et al.  A fast bit-vector algorithm for approximate string matching based on dynamic programming , 1998, JACM.

[8]  Jeremy Buhler,et al.  Designing seeds for similarity search in genomic DNA , 2003, RECOMB '03.

[9]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[10]  Ricardo A. Baeza-Yates,et al.  A Practical q -Gram Index for Text Retrieval Allowing Errors , 2018, CLEI Electron. J..

[11]  Alan M. Frieze,et al.  On the power of universal bases in sequencing by hybridization , 1999, RECOMB.

[12]  Isidore Rigoutsos,et al.  FLASH: a fast look-up algorithm for string homology , 1993, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Esko Ukkonen,et al.  Approximate String-Matching over Suffix Trees , 1993, CPM.

[14]  Martin Vingron,et al.  q-gram based database searching using a suffix array (QUASAR) , 1999, RECOMB.

[15]  Juha Kärkkäinen,et al.  One-Gapped q-Gram Filtersfor Levenshtein Distance , 2002, CPM.

[16]  T. Ideker,et al.  Mining SNPs from EST databases. , 1999, Genome research.

[17]  Gad M. Landau,et al.  Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching , 2001 .

[18]  J. Weber,et al.  Human whole-genome shotgun sequencing. , 1997, Genome research.

[19]  Thomas G. Marr,et al.  Approximate String Matching and Local Similarity , 1994, CPM.

[20]  Archie L. Cobbs,et al.  Fast Approximate Matching using Suffix Trees , 1995, CPM.

[21]  Erkki Sutinen,et al.  Filtration with q-Samples in Approximate String Matching , 1996, CPM.

[22]  Eugene W. Myers,et al.  Basic local alignment search tool. Journal of Molecular Biology , 1990 .

[23]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[24]  Erkki Sutinen,et al.  Indexing text with approximate q-grams , 2000, J. Discrete Algorithms.

[25]  Eugene W. Myers,et al.  A sublinear algorithm for approximate keyword searching , 1994, Algorithmica.

[26]  Esko Ukkonen,et al.  Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..

[27]  Erkki Sutinen,et al.  On Using q-Gram Locations in Approximate String Matching , 1995, ESA.

[28]  Martin Vingron,et al.  A set-theoretic approach to database searching and clustering , 1998, Bioinform..

[29]  Eli Upfal,et al.  Sequencing-by-Hybridization at the Information-Theory Bound: An Optimal Algorithm , 2000, J. Comput. Biol..

[30]  Pavel A. Pevzner,et al.  Multiple filtration and approximate pattern matching , 1995, Algorithmica.