Shortest DNA Cyclic Cover in Compressed Space

For a set of input words, finding a superstring (a string containing each word of the set as a substring) of minimal length is hard. Most approximation algorithms solve the Shortest Cyclic Cover problem before merging the cyclic strings into a linear superstring. A cyclic cover is a set of cyclic strings in which the input words occur as a substring. We investigate a variant of the Shortest Cyclic Cover problem for the case of DNA. Because the two strands that compose DNA have a reverse complementary sequence, and because the sequencing process often overlooks the strand of a read, each read or its reverse complement must occur as a substring in a cyclic cover. We exhibit a linear time algorithm based on graphs for solving the Shortest DNA Cyclic Cover problem and propose compressed data structures for storing the underlying graphs. All results and algorithms can be adapted to the case where strings are simply reversed but not complemented (e.g. in pattern recognition).

[1]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[2]  Tao Jiang,et al.  Linear approximation of shortest superstrings , 1991, STOC '91.

[3]  Horst Bunke,et al.  Applications of approximate string matching to 2D shape recognition , 1993, Pattern Recognit..

[4]  Kenneth Steiglitz,et al.  Combinatorial Optimization: Algorithms and Complexity , 1981 .

[5]  Gonzalo Navarro,et al.  Fast Fully-Compressed Suffix Trees , 2014, 2014 Data Compression Conference.

[6]  Enno Ohlebusch,et al.  Bidirectional search in a string with wavelet trees and bidirectional matching statistics , 2012, Inf. Comput..

[7]  Kunihiko Sadakane,et al.  Compressed Suffix Trees with Full Functionality , 2007, Theory of Computing Systems.

[8]  Alexander Golovnev,et al.  Solving SCS for bounded length strings in fewer than 2n steps , 2014, Inf. Process. Lett..

[9]  Eric Rivals,et al.  The power of greedy algorithms for approximating Max-ATSP, Cyclic Cover, and superstrings , 2016, Discret. Appl. Math..

[10]  Siu-Ming Yiu,et al.  High Throughput Short Read Alignment via Bi-directional BWT , 2009, 2009 IEEE International Conference on Bioinformatics and Biomedicine.

[11]  Eric Rivals,et al.  A linear time algorithm for Shortest Cyclic Cover of Strings , 2016, J. Discrete Algorithms.

[12]  David Maier,et al.  On Finding Minimal Length Superstrings , 1980, J. Comput. Syst. Sci..

[13]  Dirk Strothmann,et al.  The affix array data structure and its applications to RNA secondary structure analysis , 2007, Theor. Comput. Sci..

[14]  Gad M. Landau,et al.  An Efficient Algorithm for the All Pairs Suffix-Prefix Problem , 1992, Inf. Process. Lett..