Unique Reconstruction of Coded Sequences from Multiset Substring Spectra

The problem of reconstructing strings from their substring spectra has a long history and in its most simple incarnation asks for determining under which conditions the spectrum uniquely determines the string. We study the problem of coded string reconstruction from multiset substring spectra, where the strings are restricted to lie in some codebook. In particular, we consider binary codebooks that allow for unique string reconstruction and propose a new method, termed repeat replacement, to create the codebook. Our contributions include algorithmic solutions for repeat replacement and constructive redundancy bounds for the underlying coding schemes. The study is motivated by applications in DNA-based data storage systems that use high throughput readout sequencers.

[1]  Eitan Yaakobi,et al.  Codes Correcting a Burst of Deletions or Insertions , 2016, IEEE Transactions on Information Theory.

[2]  Steven Skiena,et al.  Reconstructing strings from substrings in rounds , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[3]  Alex D. Scott,et al.  Reconstructing sequences , 1997, Discret. Math..

[4]  Adriaan J. de Lind van Wijngaarden,et al.  Construction of Maximum Run-Length Limited Codes Using Sequence Replacement Techniques , 2010, IEEE Journal on Selected Areas in Communications.

[5]  Han Mao Kiah,et al.  Codes for DNA sequence profiles , 2015, 2015 IEEE International Symposium on Information Theory (ISIT).

[6]  Eitan Yaakobi,et al.  Mutually uncorrelated codes for DNA storage , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[7]  Alon Orlitsky,et al.  String Reconstruction from Substring Compositions , 2014, SIAM J. Discret. Math..

[8]  Han Mao Kiah,et al.  Rates of DNA Sequence Profiles for Practical Values of Read Lengths , 2016, IEEE Transactions on Information Theory.

[9]  Vladimir I. Levenshtein,et al.  Efficient Reconstruction of Sequences from Their Subsequences or Supersequences , 2001, J. Comb. Theory A.

[10]  Sampath Kannan,et al.  Reconstructing strings from random traces , 2004, SODA '04.

[11]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[12]  Esko Ukkonen,et al.  Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..