I T ] 9 S ep 2 01 9 Coded Trace Reconstruction

Motivated by average-case trace reconstruction and coding for portable DNA-based storage systems, we initiate the study of coded trace reconstruction, the design and analysis of high-rate efficiently encodable codes that can be efficiently decoded with high probability from few reads (also called traces) corrupted by edit errors. Codes used in current portable DNA-based storage systems with nanopore sequencers are largely based on heuristics, and have no provable robustness or performance guarantees even for an error model with i.i.d. deletions and constant deletion probability. Our work is a first step towards the design of efficient codes with provable guarantees for such systems. We consider a constant rate of i.i.d. deletions, and perform an analysis of marker-based code-constructions. This gives rise to codes with redundancy O(n/ log n) (resp. O(n/ log log n)) that can be efficiently reconstructed from exp(O(log n)) (resp. exp(O(log log n))) traces, where n is the message length. Then, we give a construction of a code with O(log n) bits of redundancy that can be efficiently reconstructed from poly(n) traces if the deletion probability is small enough. Finally, we show how to combine both approaches, giving rise to an efficient code with O(n/ log n) bits of redundancy which can be reconstructed from poly(logn) traces for a small constant deletion probability.

[1]  Donald E. Knuth,et al.  Efficient balanced codes , 1986, IEEE Trans. Inf. Theory.

[2]  Noga Alon,et al.  Simple Construction of Almost k-wise Independent Random Variables , 1992, Random Struct. Algorithms.

[3]  Vladimir I. Levenshtein,et al.  Efficient reconstruction of sequences , 2001, IEEE Trans. Inf. Theory.

[4]  Sampath Kannan,et al.  Reconstructing strings from random traces , 2004, SODA '04.

[5]  Rina Panigrahy,et al.  Trace reconstruction with constant deletion probability and related results , 2008, SODA '08.

[6]  Jos H. Weber,et al.  Very Efficient Balanced Codes , 2010, IEEE Journal on Selected Areas in Communications.

[7]  Ewan Birney,et al.  Towards practical, high-capacity, low-maintenance information storage in synthesized DNA , 2013, Nature.

[8]  Sofya Vorotnikova,et al.  Trace Reconstruction Revisited , 2014, ESA.

[9]  Michael Mitzenmacher,et al.  Repeated deletion channels , 2014, 2014 IEEE Information Theory Workshop (ITW 2014).

[10]  Jian Ma,et al.  A Rewritable, Random-Access DNA-Based Storage System , 2015, Scientific Reports.

[11]  Robert N Grass,et al.  Robust chemical preservation of digital information on DNA in silica with error-correcting codes. , 2015, Angewandte Chemie.

[12]  Jian Ma,et al.  DNA-Based Storage: Trends and Methods , 2015, IEEE Transactions on Molecular, Biological and Multi-Scale Communications.

[13]  Wojciech Szpankowski,et al.  Fundamental Bounds for Sequence Reconstruction From Nanopore Sequencers , 2016, IEEE Transactions on Molecular, Biological and Multi-Scale Communications.

[14]  Han Mao Kiah,et al.  Codes for DNA sequence profiles , 2015, 2015 IEEE International Symposium on Information Theory (ISIT).

[15]  Frederic Sala,et al.  Exact Reconstruction From Insertions in Synchronization Codes , 2016, IEEE Transactions on Information Theory.

[16]  Olgica Milenkovic,et al.  Portable and Error-Free DNA-Based Data Storage , 2016, Scientific Reports.

[17]  Olgica Milenkovic,et al.  The hybrid k-deck problem: Reconstructing sequences from short and long traces , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[18]  Kannan Ramchandran,et al.  Fundamental limits of DNA storage systems , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[19]  Ryan O'Donnell,et al.  Optimal mean-based algorithms for trace reconstruction , 2017, STOC.

[20]  Yuval Peres,et al.  Average-Case Reconstruction for the Deletion Channel: Subpolynomially Many Traces Suffice , 2017, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[21]  Yuval Peres,et al.  Trace reconstruction with exp(O(n1/3)) samples , 2017, STOC.

[22]  Christopher N. Takahashi,et al.  Random access in large-scale DNA data storage , 2018, Nature Biotechnology.

[23]  Suhas N. Diggavi,et al.  On Maximum Likelihood Reconstruction over Multiple Deletion Channels , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[24]  Fedor Nazarov,et al.  PR ] 1 2 D ec 2 01 6 Trace reconstruction with exp ( O ( n 1 / 3 ) ) samples , 2018 .

[25]  Han Mao Kiah,et al.  Exabytes in a Test Tube , 2018, IEEE Spectrum.

[26]  Andreas Lenz,et al.  Coding over Sets for DNA Storage , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[27]  Yuval Peres,et al.  Subpolynomial trace reconstruction for random strings and arbitrary deletion probability , 2018, COLT.

[28]  Eitan Yaakobi,et al.  Sequence Reconstruction Over the Deletion Channel , 2018, IEEE Transactions on Information Theory.

[29]  Yuval Peres,et al.  Trace reconstruction with varying deletion probabilities , 2018, ANALCO.

[30]  Vincent Y. F. Tan,et al.  Codes in the Space of Multisets—Coding for Permutation Channels With Impairments , 2016, IEEE Transactions on Information Theory.

[31]  Zhengzhong Jin,et al.  Deterministic Document Exchange Protocols, and Almost Optimal Binary Codes for Edit Errors , 2018, 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS).

[32]  Han Mao Kiah,et al.  Mutually Uncorrelated Primers for DNA-Based Data Storage , 2017, IEEE Transactions on Information Theory.

[33]  Olgica Milenkovic,et al.  Unique Reconstruction of Coded Sequences from Multiset Substring Spectra , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[34]  Russell Lyons,et al.  Lower bounds for trace reconstruction , 2018, ArXiv.

[35]  Bernhard Haeupler Optimal Document Exchange and New Codes for Insertions and Deletions , 2019, 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS).

[36]  Andreas Lenz,et al.  Anchor-Based Correction of Substitutions in Indexed Sets , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[37]  Wentu Song,et al.  Sequence-Subset Distance and Coding for Error Control for DNA-based Data Storage , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[38]  Rocco A. Servedio,et al.  Beyond Trace Reconstruction: Population Recovery from the Deletion Channel , 2019, 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS).

[39]  Jehoshua Bruck,et al.  On Coding Over Sliced Information , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[40]  Cyrus Rashtchian,et al.  Reconstructing Trees from Traces , 2019, COLT.

[41]  Eitan Yaakobi,et al.  Reconstruction of Sequences Over Non-Identical Channels , 2019, IEEE Transactions on Information Theory.

[42]  Andreas Lenz,et al.  Clustering-Correcting Codes , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[43]  Akshay Krishnamurthy,et al.  Trace Reconstruction: Generalized and Parameterized , 2019, ESA.

[44]  Lara Dolecek,et al.  Coding for Deletion Channels with Multiple Traces , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[45]  Ilan Shomorony,et al.  Capacity Results for the Noisy Shuffling Channel , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[46]  Zachary Chase New lower bounds for trace reconstruction , 2021 .