On approximating string selection problems with outliers

Many problems in bioinformatics are about finding strings that approximately represent a collection of given strings. We look at more general problems where some input strings can be classified as outliers. The Close to Most Strings problem is, given a set S of the same-length strings, and a parameter d, find a string x that maximizes the number of ''non-outliers'' within Hamming distance d of x. We prove that this problem has no polynomial-time approximation scheme (PTAS) unless NP has randomized polynomial-time algorithms, correcting a decade-old erroneous proof made previously in the literature. The Most Strings with Few Bad Columns problem is to find a maximum-size subset of input strings so that the number of non-identical positions is at most k; we show it has no PTAS unless P=NP. We also observe Closest to k Strings has no efficient PTAS (EPTAS) unless a parameterized complexity hierarchy collapses. In sum, outliers help model problems associated with using biological data, but we show the problem of finding an approximate solution is computationally difficult.

[1]  Dániel Marx,et al.  Parameterized Complexity and Approximation Algorithms , 2008, Comput. J..

[2]  Rajeev Motwani,et al.  Randomized Algorithms , 1995, SIGA.

[3]  Ami Litman,et al.  On covering problems of codes , 1997, Theory of Computing Systems.

[4]  Rolf Niedermeier,et al.  Parameterized Intractability of Distinguishing Substring Selection , 2006, Theory of Computing Systems.

[5]  Subhash Khot Ruling Out PTAS for Graph Min-Bisection, Densest Subgraph and Bipartite Clique , 2004, FOCS.

[6]  Lusheng Wang,et al.  Efficient Algorithms for the Closest String and Distinguishing String Selection Problems , 2009, FAW.

[7]  Bin Ma,et al.  A three-string approach to the closest string problem , 2012, J. Comput. Syst. Sci..

[8]  Rolf Niedermeier,et al.  Fixed-Parameter Algorithms for CLOSEST STRING and Related Problems , 2003, Algorithmica.

[9]  Bin Ma,et al.  Closest string with outliers , 2011, BMC Bioinformatics.

[10]  Dániel Marx,et al.  Closest Substring Problems with Small Distances , 2008, SIAM J. Comput..

[11]  Ely Porat,et al.  Cycle Detection and Correction , 2010, ICALP.

[12]  Rolf Niedermeier,et al.  On The Parameterized Intractability Of Motif Search Problems* , 2002, Comb..

[13]  Amihood Amir,et al.  Approximations and Partial Solutions for the Consensus Sequence Problem , 2011, SPIRE.

[14]  Christina Boucher,et al.  Outlier Detection for DNA Fragment Assembly , 2011, ArXiv.

[15]  P. Pardalos,et al.  Optimization techniques for string selection and comparison problems in genomics , 2005, IEEE Engineering in Medicine and Biology Magazine.

[16]  Rolf Niedermeier,et al.  On Exact and Approximation Algorithms for Distinguishing Substring Selection , 2003, FCT.

[17]  Bin Ma,et al.  Genetic Design of Drugs Without Side-Effects , 2003, SIAM J. Comput..

[18]  Bin Ma,et al.  Distinguishing string selection problems , 2003, SODA '99.

[19]  Johan Håstad,et al.  Some optimal inapproximability results , 2001, JACM.

[20]  Ning Zhang,et al.  A More Efficient Closest String Problem , 2010, BICoB.

[21]  Subhash Khot,et al.  Better Inapproximability Results for MaxClique, Chromatic Number and Min-3Lin-Deletion , 2006, ICALP.

[22]  Bin Ma,et al.  Finding similar regions in many strings , 1999, STOC '99.

[23]  Harry B. Hunt,et al.  NC-Approximation Schemes for NP- and PSPACE-Hard Problems for Geometric Graphs , 1998, J. Algorithms.

[24]  P Festa,et al.  On some optimization problems in molecular biology. , 2007, Mathematical biosciences.

[25]  Bin Ma,et al.  A Polynominal Time Approximation Scheme for the Closest Substring Problem , 2000, CPM.

[26]  Pavel A. Pevzner,et al.  Combinatorial Approaches to Finding Subtle Signals in DNA Sequences , 2000, ISMB.

[27]  Dániel Marx,et al.  Efficient Approximation Schemes for Geometric Problems? , 2005, ESA.

[28]  Aditya Bhaskara,et al.  Detecting high log-densities: an O(n¼) approximation for densest k-subgraph , 2010, STOC '10.

[29]  Bin Ma,et al.  Finding Similar Regions in Many Sequences , 2002, J. Comput. Syst. Sci..

[30]  Gad M. Landau,et al.  An Algorithm for Approximate Tandem Repeats , 2001, J. Comput. Biol..

[31]  Panos M. Pardalos,et al.  Efficient solutions for the far from most string problem , 2012, Ann. Oper. Res..

[32]  Sanjeev Arora,et al.  Polynomial time approximation schemes for Euclidean traveling salesman and other geometric problems , 1998, JACM.

[33]  Bin Ma,et al.  More Efficient Algorithms for Closest String and Substring Problems , 2008, SIAM J. Comput..

[34]  Alexandr Andoni,et al.  On the Optimality of the Dimensionality Reduction Method , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[35]  Andrzej Lingas,et al.  Efficient approximation algorithms for the Hamming center problem , 1999, SODA '99.

[36]  Dániel Marx,et al.  Slightly superexponential parameterized problems , 2011, SODA '11.