The Shortest Superstring Problem

The shortest superstring problem (SSP) is a combinatorial optimization problem which has attracted the interest of many researchers due to its applications in computational molecular biology and in computer science. The SSP is an NP-hard problem, and therefore great effort to develop exact algorithms for it has not been made. On the other hand, several approximation and heuristic algorithms have been implemented indicating the strong effectiveness of the greedy strategies to this problem. Variations of these algorithms can be parallelized providing computational strength in solving real-world instances. Polynomially solvable versions of the problem obtained under specific restrictions to its parameters reveal the boundaries between hard and easy cases. The computational bounds on the approximability of the SSP are a realization of its Max-SNP-hardness, but the weak proved values of them reflect the potential strength of the greedy approximation techniques. The strength of the greedy methods for the SSP is enhanced also by the asymptotic behaviour and the smoothed analysis of the problem in random and real-world instances, respectively. All these issues are presented in this chapter in a concise way covering the whole relevant literature, revealing the knowledge that is already conquered, and paving the path for further development in the study of shortest superstrings. The order of the sections highlights the pass from hardness complexity results for the SSP to efficient algorithms for the problem based on greedy strategies, and to theoretical results that establish the strength of the greedy techniques.

[1]  Clifford Stein,et al.  Short Superstrings and the Structure of Overlapping Strings , 1995, J. Comput. Biol..

[2]  James A. Storer,et al.  Data Compression: Methods and Theory , 1987 .

[3]  Ming Li,et al.  Towards a DNA sequencing theory (learning a string) , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[4]  Lalit M. Patnaik,et al.  Genetic algorithms: a survey , 1994, Computer.

[5]  X. Huang,et al.  A contig assembly program based on sensitive detection of fragment overlaps. , 1992, Genomics.

[6]  Bin Ma,et al.  Why greed works for shortest common superstring problem , 2008, Theor. Comput. Sci..

[7]  Tao Jiang,et al.  Rotations of Periodic Strings and Short Superstrings , 1996, J. Algorithms.

[8]  Marek Karpinski,et al.  On Some Tighter Inapproximability Results (Extended Abstract) , 1999, ICALP.

[9]  Dan Gusfield Faster Implementation of a Shortest Superstring Approximation , 1994, Inf. Process. Lett..

[10]  V. G. Timkovskii Complexity of common subsequence and supersequence problems and related problems , 1989 .

[11]  Hans Söderlund,et al.  SEQAID: a DNA sequence assembling program based on a mathematical model , 1984, Nucleic Acids Res..

[12]  Lucian Ilie,et al.  The Shortest Common Superstring Problem and Viral Genome Compression , 2006, Fundam. Informaticae.

[13]  A. Lesk COMPUTATIONAL MOLECULAR BIOLOGY , 1988, Proceeding of Data For Discovery.

[14]  Niklaus Wirth,et al.  Algorithms and Data Structures , 1989, Lecture Notes in Computer Science.

[15]  Alan M. Frieze,et al.  Greedy Algorithms for the Shortest Common Superstring That Are Asymptotically Optimal , 1998, Algorithmica.

[16]  James A. Storer,et al.  The macro model for data compression (Extended Abstract) , 1978, STOC '78.

[17]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[18]  Jonathan S. Turner,et al.  Approximation Algorithms for the Shortest Common Superstring Problem , 1989, Inf. Comput..

[19]  Moshe Sipper,et al.  The preservation of favored building blocks in the struggle for fitness: the puzzle algorithm , 2004, IEEE Transactions on Evolutionary Computation.

[20]  Tao Jiang,et al.  DNA sequencing and string learning , 2005, Mathematical systems theory.

[21]  Sascha Ott Lower Bounds for Approximating Shortest Superstrings over an Alphabet of Size 2 , 1999, WG.

[22]  Wojciech Rytter,et al.  Parallel and Sequential Approximations of Shortest Superstrings , 1994, SWAT.

[23]  Martin Middendorf More on the Complexity of Common Superstring and Supersequence Problems , 1994, Theor. Comput. Sci..

[24]  Carsten Lund,et al.  Proof verification and the hardness of approximation problems , 1998, JACM.

[25]  Moshe Sipper,et al.  Coevolving solutions to the shortest common superstring problem. , 2004, Bio Systems.

[26]  John H. Reif,et al.  Synthesis of Parallel Algorithms , 1993 .

[27]  Raymond E. Miller,et al.  Complexity of Computer Computations , 1972 .

[28]  Moshe Lewenstein,et al.  Approximation algorithms for asymmetric TSP by decomposing directed regular multigraphs , 2005, JACM.

[29]  Robert E. Tarjan,et al.  Data structures and network algorithms , 1983, CBMS-NSF regional conference series in applied mathematics.

[30]  R. Fleischmann,et al.  Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. , 1995, Science.

[31]  Lawrence T. Kou,et al.  Polynomial Complete Consecutive Information Retrieval Problems , 1977, SIAM J. Comput..

[32]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[33]  R. Staden Automation of the computer handling of gel reading data produced by the shotgun method of DNA sequencing. , 1982, Nucleic acids research.

[34]  David I. Lewin,et al.  DNA computing , 2002, Comput. Sci. Eng..

[35]  Enrique Mérida Casermeiro,et al.  An Associative Multivalued Recurrent Network , 2002, IBERAMIA.

[36]  Zhen Zhang,et al.  The shortest common superstring problem: Average case analysis for both exact and approximate matching , 1999, IEEE Trans. Inf. Theory.

[37]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[38]  Clifford Stein,et al.  Improved Length Bounds for the Shortest Superstring Problem (Extended Abstract) , 1995, WADS.

[39]  C. Ribeiro,et al.  Essays and Surveys in Metaheuristics , 2002, Operations Research/Computer Science Interfaces Series.

[40]  F. Sanger,et al.  DNA sequencing with chain-terminating inhibitors. , 1977, Proceedings of the National Academy of Sciences of the United States of America.

[41]  E. B. James,et al.  Information Compression by Factorising Common Strings , 1975, Computer/law journal.

[42]  P.A. Pevzner,et al.  Open combinatorial problems in computational molecular biology , 1995, Proceedings Third Israel Symposium on the Theory of Computing and Systems.

[43]  Clifford Stein,et al.  A 2 2 3 {approximation Algorithm for the Shortest Superstring Problem , 1995 .

[44]  H. Wilf,et al.  Uniqueness theorems for periodic functions , 1965 .

[45]  Wojciech Rytter,et al.  Sequential and Parallel Approximation of Shortest Superstrings , 1997, J. Algorithms.

[46]  Kenneth S. Alexander,et al.  Shortest Common Superstrings for Strings of Random Letters , 1994, CPM.

[47]  Esko Ukkonen,et al.  A Greedy Algorithm for Constructing Shortest Common Superstrings , 1986, MFCS.

[48]  Dennis Komm,et al.  Reoptimization of the Shortest Common Superstring Problem , 2009, Algorithmica.

[49]  Claude E. Shannon,et al.  The mathematical theory of communication , 1950 .

[50]  Michel Rigo,et al.  Abstract numeration systems and tilings , 2005 .

[51]  Kenneth A. De Jong,et al.  Cooperative Coevolution: An Architecture for Evolving Coadapted Subcomponents , 2000, Evolutionary Computation.

[52]  Shang-Hua Teng,et al.  Smoothed Analysis (Motivation and Discrete Models) , 2003, WADS.

[53]  Richard M. Karp,et al.  Reducibility Among Combinatorial Problems , 1972, 50 Years of Integer Programming.

[54]  Enrique Mérida Casermeiro Red neuronal recurrente multivaluada para el reconocimiento de patrones y la optimización combinatoria , 2000 .

[55]  Panos M. Pardalos,et al.  Handbook of applied optimization , 2002 .

[56]  Clifford Stein,et al.  Long tours and short superstrings , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[57]  Robert Belshaw,et al.  Why genes overlap in viruses , 2010, Proceedings of the Royal Society B: Biological Sciences.

[58]  Marvin B. Shapiro An Algorithm for Reconstructing Protein and RNA Sequences , 1967, JACM.

[59]  Mauricio G. C. Resende,et al.  Greedy Randomized Adaptive Search Procedures , 1995, J. Glob. Optim..

[60]  Tao Jiang,et al.  On the Approximation of Shortest Common Supersequences and Longest Common Subsequences , 1995, SIAM J. Comput..

[61]  Esko Ukkonen,et al.  A Greedy Approximation Algorithm for Constructing Shortest Common Superstrings , 1988, Theor. Comput. Sci..

[62]  Jan Wessnitzer,et al.  A Model of Non-elemental Associative Learning in the Mushroom Body Neuropil of the Insect Brain , 2007, ICANNGA.

[63]  W. Fiers,et al.  Nucleotide Sequence of the Gene Coding for the Bacteriophage MS2 Coat Protein , 1972, Nature.

[64]  Leonidas S. Pitsoulis,et al.  A greedy randomized adaptive search procedure with path relinking for the shortest superstring problem , 2013, Journal of Combinatorial Optimization.

[65]  Arthur M. Lesk Computational Molecular Biology: Sources and Methods for Sequence Analysis , 1989 .

[66]  Mihalis Yannakakis,et al.  Optimization, approximation, and complexity classes , 1991, STOC '88.

[67]  F. Frances Yao,et al.  Approximating shortest superstrings , 1997, Proceedings of 1993 IEEE 34th Annual Foundations of Computer Science.

[68]  R. A. Zemlin,et al.  Integer Programming Formulation of Traveling Salesman Problems , 1960, JACM.

[69]  Markus Bläser,et al.  An 8/13-approximation algorithm for the asymmetric maximum TSP , 2002, SODA '02.

[70]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[71]  T. Gingeras,et al.  Computer programs for the assembly of DNA sequences. , 1979, Nucleic acids research.

[72]  Stephanie Forrest,et al.  Architecture for an Artificial Immune System , 2000, Evolutionary Computation.

[73]  Robert E. Tarjan,et al.  Self-adjusting binary search trees , 1985, JACM.

[74]  James A. Storer,et al.  Data compression via textual substitution , 1982, JACM.

[75]  Alan J. Cann,et al.  Principles of molecular virology , 1993 .

[76]  Lucian Ilie,et al.  Viral Genome Compression , 2006, DNA.

[77]  Andrei Tchernykh,et al.  An experimental comparison of two approximation algorithms for the common superstring problem , 2004, Proceedings of the Fifth Mexican International Conference in Computer Science, 2004. ENC 2004..

[78]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[79]  Bonnie Berger,et al.  Efficient NC Algorithms for Set Cover with Applications to Learning and Geometry , 1994, J. Comput. Syst. Sci..

[80]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[81]  D. Huffman A Method for the Construction of Minimum-Redundancy Codes , 1952 .

[82]  Kalyanmoy Deb,et al.  Messy Genetic Algorithms: Motivation, Analysis, and First Results , 1989, Complex Syst..

[83]  Tao Jiang,et al.  Approximating Shortest Superstrings with Constraints , 1994, Theor. Comput. Sci..

[84]  John Gallant String compression algorithms , 1982 .

[85]  D. Spielman,et al.  Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time , 2004 .

[86]  Tao Jiang,et al.  Linear approximation of shortest superstrings , 1994, JACM.

[87]  Steven Skiena,et al.  Reconstructing Strings from Substrings , 1995, J. Comput. Biol..

[88]  Mihalis Yannakakis,et al.  The Traveling Salesman Problem with Distances One and Two , 1993, Math. Oper. Res..

[89]  David Maier,et al.  On Finding Minimal Length Superstrings , 1980, J. Comput. Syst. Sci..

[90]  T. A. Jenkyns The greedy travelling salesman's problem , 1979, Networks.

[91]  David Maier,et al.  The Complexity of Some Problems on Subsequences and Supersequences , 1978, JACM.

[92]  Virginia Vassilevska,et al.  Explicit inapproximability bounds for the shortest superstring problem , 2005 .

[93]  Rafael Martí,et al.  GRASP and Path Relinking for 2-Layer Straight Line Crossing Minimization , 1999, INFORMS J. Comput..

[94]  Martin Middendorf Shortest Common Superstrings and Scheduling with Coordinated Starting Times , 1998, Theor. Comput. Sci..

[95]  J. J. Hopfield,et al.  “Neural” computation of decisions in optimization problems , 1985, Biological Cybernetics.

[96]  Kai Plociennik A Probabilistic PTAS for Shortest Common Superstring , 2009, MFCS.

[97]  Kevin N. Gurney,et al.  An introduction to neural networks , 2018 .

[98]  Haim Kaplan,et al.  The greedy algorithm for shortest superstrings , 2005, Inf. Process. Lett..

[99]  Mauricio G. C. Resende,et al.  Grasp: An Annotated Bibliography , 2002 .

[100]  Kenneth Steiglitz,et al.  Combinatorial Optimization: Algorithms and Complexity , 1981 .

[101]  Eli Upfal,et al.  Constructing a perfect matching is in random NC , 1985, STOC '85.

[102]  Enrique Mérida Casermeiro,et al.  Shortest Common Superstring Problem with Discrete Neural Networks , 2009, ICANNGA.

[103]  Elizabeth Sweedyk,et al.  A 2½-Approximation Algorithm for Shortest Superstring , 1999, SIAM J. Comput..

[104]  Rajeev Motwani,et al.  On syntactic versus computational views of approximability , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.