Coded Trace Reconstruction

Motivated by average-case trace reconstruction and coding for portable DNA-based storage systems, we initiate the study of coded trace reconstruction, the design and analysis of high-rate efficiently encodable codes that can be efficiently decoded with high probability from few reads (also called traces) corrupted by edit errors. Codes used in current portable DNA-based storage systems with nanopore sequencers are largely based on heuristics, and have no provable robustness or performance guarantees even for an error model with i.i. d. deletions and constant deletion probability. Our work is a first step towards the design of efficient codes with provable guarantees for such systems. We consider a constant rate of i.i. d. deletions, and begin by analyzing marker-based code-constructions coupled with worst-case trace reconstruction algorithms. Then, we show how a more careful design of the code allows us to exploit ideas from average-case trace reconstruction to reduce the number of traces required with the same redundancy.

[1]  Kannan Ramchandran,et al.  Fundamental limits of DNA storage systems , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[2]  Eitan Yaakobi,et al.  Reconstruction of Sequences Over Non-Identical Channels , 2019, IEEE Transactions on Information Theory.

[3]  Wojciech Szpankowski,et al.  Fundamental Bounds for Sequence Reconstruction From Nanopore Sequencers , 2016, IEEE Transactions on Molecular, Biological and Multi-Scale Communications.

[4]  Andreas Lenz,et al.  Coding over Sets for DNA Storage , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[5]  Olgica Milenkovic,et al.  The hybrid k-deck problem: Reconstructing sequences from short and long traces , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[6]  Yuval Peres,et al.  Subpolynomial trace reconstruction for random strings and arbitrary deletion probability , 2018, COLT.

[7]  Bernhard Haeupler Optimal Document Exchange and New Codes for Insertions and Deletions , 2019, 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS).

[8]  Wentu Song,et al.  Sequence-Subset Distance and Coding for Error Control for DNA-based Data Storage , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[9]  Suhas N. Diggavi,et al.  On Maximum Likelihood Reconstruction over Multiple Deletion Channels , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[10]  Yuval Peres,et al.  Trace reconstruction with exp(O(n1/3)) samples , 2017, STOC.

[11]  Russell Lyons,et al.  Lower bounds for trace reconstruction , 2018, ArXiv.

[12]  Ilan Shomorony,et al.  Capacity Results for the Noisy Shuffling Channel , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[13]  Andreas Lenz,et al.  Clustering-Correcting Codes , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[14]  Cyrus Rashtchian,et al.  Reconstructing Trees from Traces , 2019, COLT.

[15]  Eitan Yaakobi,et al.  Sequence Reconstruction Over the Deletion Channel , 2018, IEEE Transactions on Information Theory.

[16]  Rocco A. Servedio,et al.  Beyond Trace Reconstruction: Population Recovery from the Deletion Channel , 2019, 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS).

[17]  Zachary Chase New Upper Bounds for Trace Reconstruction , 2020, ArXiv.

[18]  Jian Ma,et al.  A Rewritable, Random-Access DNA-Based Storage System , 2015, Scientific Reports.

[19]  Akshay Krishnamurthy,et al.  Trace Reconstruction: Generalized and Parameterized , 2019, ESA.

[20]  Ryan O'Donnell,et al.  Optimal mean-based algorithms for trace reconstruction , 2017, STOC.

[21]  Vincent Y. F. Tan,et al.  Codes in the Space of Multisets—Coding for Permutation Channels With Impairments , 2016, IEEE Transactions on Information Theory.

[22]  Eitan Yaakobi,et al.  Reconstruction of sequences over non-identical channels , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[23]  Olgica Milenkovic,et al.  Unique Reconstruction of Coded Sequences from Multiset Substring Spectra , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[24]  Jehoshua Bruck,et al.  On Coding Over Sliced Information , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[25]  Sofya Vorotnikova,et al.  Trace Reconstruction Revisited , 2014, ESA.

[26]  Zachary Chase New lower bounds for trace reconstruction , 2021 .

[27]  Frederic Sala,et al.  Exact Reconstruction From Insertions in Synchronization Codes , 2016, IEEE Transactions on Information Theory.

[28]  Han Mao Kiah,et al.  Codes for DNA sequence profiles , 2015, 2015 IEEE International Symposium on Information Theory (ISIT).

[29]  Edward A. Ratzer Marker codes for channels with insertions and deletions , 2005, Ann. des Télécommunications.

[30]  Michael Mitzenmacher,et al.  Repeated deletion channels , 2014, 2014 IEEE Information Theory Workshop (ITW 2014).

[31]  Rina Panigrahy,et al.  Trace reconstruction with constant deletion probability and related results , 2008, SODA '08.

[32]  Lara Dolecek,et al.  Coding for Deletion Channels with Multiple Traces , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[33]  Yuval Peres,et al.  Trace reconstruction with varying deletion probabilities , 2018, ANALCO.

[34]  Sampath Kannan,et al.  Reconstructing strings from random traces , 2004, SODA '04.

[35]  Robert N Grass,et al.  Robust chemical preservation of digital information on DNA in silica with error-correcting codes. , 2015, Angewandte Chemie.

[36]  Jian Ma,et al.  DNA-Based Storage: Trends and Methods , 2015, IEEE Transactions on Molecular, Biological and Multi-Scale Communications.

[37]  Donald E. Knuth,et al.  Efficient balanced codes , 1986, IEEE Trans. Inf. Theory.

[38]  Jos H. Weber,et al.  Very Efficient Balanced Codes , 2010, IEEE Journal on Selected Areas in Communications.

[39]  Olgica Milenkovic,et al.  Portable and Error-Free DNA-Based Data Storage , 2016, Scientific Reports.

[40]  Zhengzhong Jin,et al.  Deterministic Document Exchange Protocols, and Almost Optimal Binary Codes for Edit Errors , 2018, 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS).

[41]  Cyrus Rashtchian,et al.  Random access in large-scale DNA data storage , 2018, Nature Biotechnology.

[42]  Han Mao Kiah,et al.  Mutually Uncorrelated Primers for DNA-Based Data Storage , 2017, IEEE Transactions on Information Theory.

[43]  Han Mao Kiah,et al.  Exabytes in a Test Tube , 2018, IEEE Spectrum.

[44]  Ewan Birney,et al.  Towards practical, high-capacity, low-maintenance information storage in synthesized DNA , 2013, Nature.

[45]  Yuval Peres,et al.  Average-Case Reconstruction for the Deletion Channel: Subpolynomially Many Traces Suffice , 2017, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[46]  Vladimir I. Levenshtein,et al.  Efficient reconstruction of sequences , 2001, IEEE Trans. Inf. Theory.