Coded Trace Reconstruction

Motivated by average-case trace reconstruction and coding for portable DNA-based storage systems, we initiate the study of <italic>coded trace reconstruction</italic>, the design and analysis of high-rate efficiently encodable codes that can be efficiently decoded with high probability from few reads (also called <italic>traces</italic>) corrupted by edit errors. Codes used in current portable DNA-based storage systems with nanopore sequencers are largely based on heuristics, and have no provable robustness or performance guarantees even for an error model with i.i.d. deletions and constant deletion probability. Our work is the first step towards the design of efficient codes with provable guarantees for such systems. We consider a constant rate of i.i.d. deletions, and perform an analysis of marker-based code-constructions. This gives rise to codes with redundancy <inline-formula> <tex-math notation="LaTeX">$O(n/\log n)$ </tex-math></inline-formula> (resp. <inline-formula> <tex-math notation="LaTeX">$O(n/\log \log n)$ </tex-math></inline-formula>) that can be efficiently reconstructed from <inline-formula> <tex-math notation="LaTeX">$\exp (O(\log ^{2/3}n))$ </tex-math></inline-formula> (resp. <inline-formula> <tex-math notation="LaTeX">$\exp (O(\log \log n)^{2/3})$ </tex-math></inline-formula>) traces, where <inline-formula> <tex-math notation="LaTeX">$n$ </tex-math></inline-formula> is the message length. Then, we give a construction of a code with <inline-formula> <tex-math notation="LaTeX">$O(\log n)$ </tex-math></inline-formula> bits of redundancy that can be efficiently reconstructed from <inline-formula> <tex-math notation="LaTeX">$\text {poly}(n)$ </tex-math></inline-formula> traces if the deletion probability is small enough. Finally, we show how to combine both approaches, giving rise to an efficient code with <inline-formula> <tex-math notation="LaTeX">$O(n/\log n)$ </tex-math></inline-formula> bits of redundancy which can be reconstructed from <inline-formula> <tex-math notation="LaTeX">$\text {poly}(\log n)$ </tex-math></inline-formula> traces for a small constant deletion probability.

[1]  Rocco A. Servedio,et al.  Efficient average-case population recovery in the presence of insertions and deletions , 2019, APPROX-RANDOM.

[2]  Sofya Vorotnikova,et al.  Trace Reconstruction Revisited , 2014, ESA.

[3]  Frederic Sala,et al.  Exact Reconstruction From Insertions in Synchronization Codes , 2016, IEEE Transactions on Information Theory.

[4]  Cyrus Rashtchian,et al.  Reconstructing Trees from Traces , 2019, COLT.

[5]  Russell Lyons,et al.  Lower bounds for trace reconstruction , 2018, ArXiv.

[6]  Andreas Lenz,et al.  Anchor-Based Correction of Substitutions in Indexed Sets , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[7]  Yuval Peres,et al.  Trace reconstruction with varying deletion probabilities , 2018, ANALCO.

[8]  Eitan Yaakobi,et al.  Sequence Reconstruction Over the Deletion Channel , 2018, IEEE Transactions on Information Theory.

[9]  Han Mao Kiah,et al.  Codes for DNA sequence profiles , 2015, 2015 IEEE International Symposium on Information Theory (ISIT).

[10]  Yuval Peres,et al.  Subpolynomial trace reconstruction for random strings and arbitrary deletion probability , 2018, COLT.

[11]  Sampath Kannan,et al.  Reconstructing strings from random traces , 2004, SODA '04.

[12]  Robert N Grass,et al.  Robust chemical preservation of digital information on DNA in silica with error-correcting codes. , 2015, Angewandte Chemie.

[13]  Zachary Chase New Upper Bounds for Trace Reconstruction , 2020, ArXiv.

[14]  Eitan Yaakobi,et al.  Reconstruction of sequences over non-identical channels , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[15]  Olgica Milenkovic,et al.  Unique Reconstruction of Coded Sequences from Multiset Substring Spectra , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[16]  Jehoshua Bruck,et al.  On Coding Over Sliced Information , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[17]  Bruce Spang,et al.  Coded trace reconstruction in a constant number of traces , 2020, 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS).

[18]  Suhas N. Diggavi,et al.  On Maximum Likelihood Reconstruction over Multiple Deletion Channels , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[19]  Christina Fragouli,et al.  Symbolwise MAP for Multiple Deletion Channels , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[20]  Andreas Lenz,et al.  Coding over Sets for DNA Storage , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[21]  Rocco A. Servedio,et al.  Beyond Trace Reconstruction: Population Recovery from the Deletion Channel , 2019, 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS).

[22]  Kannan Ramchandran,et al.  Fundamental limits of DNA storage systems , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[23]  Wojciech Szpankowski,et al.  Fundamental Bounds for Sequence Reconstruction From Nanopore Sequencers , 2016, IEEE Transactions on Molecular, Biological and Multi-Scale Communications.

[24]  Eitan Yaakobi,et al.  Reconstruction of Sequences in DNA Storage , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[25]  Jos H. Weber,et al.  Very Efficient Balanced Codes , 2010, IEEE Journal on Selected Areas in Communications.

[26]  Han Mao Kiah,et al.  Mutually Uncorrelated Primers for DNA-Based Data Storage , 2017, IEEE Transactions on Information Theory.

[27]  Cyrus Rashtchian,et al.  Clustering Billions of Reads for DNA Data Storage , 2017, NIPS.

[28]  Olgica Milenkovic,et al.  The hybrid k-deck problem: Reconstructing sequences from short and long traces , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[29]  Cyrus Rashtchian,et al.  Random access in large-scale DNA data storage , 2018, Nature Biotechnology.

[30]  Jian Ma,et al.  DNA-Based Storage: Trends and Methods , 2015, IEEE Transactions on Molecular, Biological and Multi-Scale Communications.

[31]  Donald E. Knuth,et al.  Efficient balanced codes , 1986, IEEE Trans. Inf. Theory.

[32]  Olgica Milenkovic,et al.  Portable and Error-Free DNA-Based Data Storage , 2016, Scientific Reports.

[33]  Zhengzhong Jin,et al.  Deterministic Document Exchange Protocols, and Almost Optimal Binary Codes for Edit Errors , 2018, 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS).

[34]  Ewan Birney,et al.  Towards practical, high-capacity, low-maintenance information storage in synthesized DNA , 2013, Nature.

[35]  Yuval Peres,et al.  Average-Case Reconstruction for the Deletion Channel: Subpolynomially Many Traces Suffice , 2017, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[36]  Vladimir I. Levenshtein,et al.  Efficient reconstruction of sequences , 2001, IEEE Trans. Inf. Theory.

[37]  Wentu Song,et al.  Sequence-Subset Distance and Coding for Error Control for DNA-based Data Storage , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[38]  Vincent Y. F. Tan,et al.  Codes in the Space of Multisets—Coding for Permutation Channels With Impairments , 2016, IEEE Transactions on Information Theory.

[39]  Michael Mitzenmacher,et al.  Repeated deletion channels , 2014, 2014 IEEE Information Theory Workshop (ITW 2014).

[40]  Rina Panigrahy,et al.  Trace reconstruction with constant deletion probability and related results , 2008, SODA '08.

[41]  Lara Dolecek,et al.  Coding for Deletion Channels with Multiple Traces , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[42]  Yuval Peres,et al.  Trace reconstruction with exp(O(n1/3)) samples , 2017, STOC.

[43]  Ilan Shomorony,et al.  Capacity Results for the Noisy Shuffling Channel , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[44]  Ryan O'Donnell,et al.  Optimal mean-based algorithms for trace reconstruction , 2017, STOC.

[45]  Zachary Chase New lower bounds for trace reconstruction , 2021 .

[46]  Akshay Krishnamurthy,et al.  Trace Reconstruction: Generalized and Parameterized , 2019, ESA.

[47]  Han Mao Kiah,et al.  Exabytes in a Test Tube , 2018, IEEE Spectrum.

[48]  Jian Ma,et al.  A Rewritable, Random-Access DNA-Based Storage System , 2015, Scientific Reports.

[49]  Bernhard Haeupler Optimal Document Exchange and New Codes for Insertions and Deletions , 2019, 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS).