Hardness of optimal spaced seed design

Speeding up approximate pattern matching is a line of research in stringology since the 80's. Practically fast approaches belong to the class of filtration algorithms, in which text regions dissimilar to the pattern are excluded (filtered out) in a first step, and remaining regions are compared to the pattern by dynamic programming in a second step. Among the necessary conditions used to test similarity between the regions and the pattern, many require a minimum number of common substrings between them. When only substitutions are taken into account for measuring dissimilarity, it was shown recently that counting spaced subwords instead of substrings improve the filtration efficiency. However, a preprocessing step is required to design one or more patterns, called gapped seeds, for the subwords, depending on the search parameters. The seed design problems proposed up to now differ by the way the similarities to detect are given: either a set of similarities is given in extenso (this is a “region specific” problem), or one wishes to detect all similar regions having at most k substitutions (general detection problem). Several articles exhibit exponential algorithms for these problems. In this work, we provide hardness and inapproximability results for both the region specific and general seed design problems, thereby justifying the exponential complexity of known algorithms. Moreover, we introduce a new formulation of the region specific seed design problem, in which the weight of the seed (i.e., number of characters in the subwords) has to be maximized, and show it is as difficult to approximate than Maximum Independent Set.

[1]  Michael R. Fellows,et al.  Parameterized Complexity , 1998 .

[2]  Dorit S. Hochbaum,et al.  Approximation Algorithms for NP-Hard Problems , 1996 .

[3]  J. Edmonds Paths, Trees, and Flowers , 1965, Canadian Journal of Mathematics.

[4]  Daniel G. Brown,et al.  Optimal Spaced Seeds for Homologous Coding Regions , 2004, J. Bioinform. Comput. Biol..

[5]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[6]  Jeremy Buhler,et al.  Designing seeds for similarity search in genomic DNA , 2003, RECOMB '03.

[7]  J. Håstad Clique is hard to approximate withinn1−ε , 1999 .

[8]  Juha Kärkkäinen,et al.  Better Filtering with Gapped q-Grams , 2001, Fundam. Informaticae.

[9]  Danièle Beauquier,et al.  Tiling Figures of the Plane with Two Bars , 1995, Comput. Geom..

[10]  Jeremy Buhler,et al.  Designing Multiple Simultaneous Seeds for DNA Similarity Search , 2005, J. Comput. Biol..

[11]  Esko Ukkonen,et al.  Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..

[12]  Gad M. Landau,et al.  Optimal spaced seeds for faster approximate string matching , 2007, J. Comput. Syst. Sci..

[13]  W. C. Babcock Intermodulation interference in radio systems frequency of occurrence and control by channel selection , 1953 .

[14]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[15]  Louxin Zhang,et al.  Sensitivity analysis and efficient method for identifying optimal spaced seeds , 2004, J. Comput. Syst. Sci..

[16]  Gary Benson,et al.  Indel seeds for homology search , 2006, ISMB.

[17]  Dorit S. Hochba,et al.  Approximation Algorithms for NP-Hard Problems , 1997, SIGA.

[18]  Richard M. Karp,et al.  Efficient Randomized Pattern-Matching Algorithms , 1987, IBM J. Res. Dev..

[19]  Bin Ma,et al.  On the complexity of the spaced seeds , 2007, J. Comput. Syst. Sci..

[20]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[21]  Isidore Rigoutsos,et al.  FLASH: a fast look-up algorithm for string homology , 1993, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Gregory Kucherov,et al.  A unifying framework for seed sensitivity and its application to subset seeds , 2006, J. Bioinform. Comput. Biol..

[23]  Martin Vingron,et al.  q-gram based database searching using a suffix array (QUASAR) , 1999, RECOMB.

[24]  Gregory Kucherov,et al.  Improved hit criteria for DNA local alignment , 2004, BMC Bioinformatics.

[25]  David Zuckerman,et al.  Electronic Colloquium on Computational Complexity, Report No. 100 (2005) Linear Degree Extractors and the Inapproximability of MAX CLIQUE and CHROMATIC NUMBER , 2005 .

[26]  Gregory Kucherov,et al.  Multi-seed Lossless Filtration (Extended Abstract) , 2004, CPM.

[27]  Yann Ponty,et al.  Estimating seed sensitivity on homogeneous alignments , 2004, Proceedings. Fourth IEEE Symposium on Bioinformatics and Bioengineering.

[28]  Ming Li,et al.  Superiority and complexity of the spaced seeds , 2006, SODA 2006.

[29]  Bin Ma,et al.  Patternhunter Ii: Highly Sensitive and Fast Homology Search , 2004, J. Bioinform. Comput. Biol..

[30]  Eugene L. Lawler,et al.  Sublinear approximate string matching and biological applications , 1994, Algorithmica.

[31]  G. Kucherov,et al.  Multiseed lossless filtration , 2009, IEEE/ACM Transactions on Computational Biology and Bioinformatics.