Distance-preserving vector space embedding for the closest string problem

The closest string problem is a core problem in computational biology with applications in other fields like coding theory. Many algorithms exist to solve this problem, but due to its inherent high computational complexity (typically NP-hard), it can only be solved efficiently by restricting the search space to a specific range of parameters. Often, the run-time of these algorithms is exponential in the maximum distance between strings, restricting these solutions to very small distances. Recently, a prototype embedding method has been proposed to solve the similar generalized median problem for arbitrary objects. In this approach, objects are transformed into vector space using prototype embedding. The problem is solved in vector space and afterwards inversely transformed back into original space. This method has been successfully applied to generalized median computation in several domains where the computational complexity is inherently high. In this work, we apply prototype embedding to the closest string problem. We show that different embedding methods can result in a very good and fast approximation of the closest string, independent of the maximum distance and other parameters.

[1]  Christos Faloutsos,et al.  FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets , 1995, SIGMOD '95.

[2]  Giuseppe Lancia,et al.  Banishing Bias from Consensus Sequences , 1997, CPM.

[3]  E. Alper Yildirim,et al.  Two Algorithms for the Minimum Enclosing Ball Problem , 2008, SIAM J. Optim..

[4]  Eric O. Postma,et al.  Dimensionality Reduction: A Comparative Review , 2008 .

[5]  Zhi-Zhong Chen,et al.  Fast Exact Algorithms for the Closest String and Substring Problems with Application to the Planted (L,d)-Motif Model , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[6]  H. Gabriela,et al.  Cluster-preserving Embedding of Proteins , 1999 .

[7]  Jeanny Hérault,et al.  Curvilinear component analysis: a self-organizing neural network for nonlinear mapping of data sets , 1997, IEEE Trans. Neural Networks.

[8]  Bin Ma,et al.  On the closest string and substring problems , 2002, JACM.

[9]  J. Bourgain On lipschitz embedding of finite metric spaces in Hilbert space , 1985 .

[10]  Xiaoyi Jiang,et al.  Ensemble clustering by means of clustering embedding in vector spaces , 2014, Pattern Recognit..

[11]  Ami Litman,et al.  On covering problems of codes , 1997, Theory of Computing Systems.

[12]  Hanan Samet,et al.  Properties of Embedding Methods for Similarity Searching in Metric Spaces , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Ernest Valveny,et al.  A generic framework for median graph computation based on a recursive embedding approach , 2011, Comput. Vis. Image Underst..

[14]  Amihood Amir,et al.  On the hardness of the Consensus String problem , 2013, Inf. Process. Lett..

[15]  Kilian Q. Weinberger,et al.  An Introduction to Nonlinear Dimensionality Reduction by Maximum Variance Unfolding , 2006, AAAI.

[16]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[17]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[18]  Ernest Valveny,et al.  Generalized median graph computation by means of graph embedding in vector spaces , 2010, Pattern Recognit..

[19]  Jeong Seop Sim,et al.  The consensus string problem for a metric is NP-complete , 2003, J. Discrete Algorithms.

[20]  Bin Ma,et al.  More Efficient Algorithms for Closest String and Substring Problems , 2008, SIAM J. Comput..

[21]  J. Kruskal Nonmetric multidimensional scaling: A numerical method , 1964 .

[22]  John W. Sammon,et al.  A Nonlinear Mapping for Data Structure Analysis , 1969, IEEE Transactions on Computers.

[23]  Frank Plastria,et al.  On the point for which the sum of the distances to n given points is minimum , 2009, Ann. Oper. Res..

[24]  Kaizhong Zhang,et al.  MetricMap: an embedding technique for processing distance-based queries in metric spaces , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[25]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[26]  Rolf Niedermeier,et al.  Fixed-Parameter Algorithms for CLOSEST STRING and Related Problems , 2003, Algorithmica.