Mass Error-Correction Codes for Polymer-Based Data Storage

We consider the problem of correcting mass readout errors in information encoded in binary polymer strings. Our work builds on results for string reconstruction problems using composition multisets [1] and the unique string reconstruction framework proposed in [2]. Binary polymer-based data storage systems [3] operate by designing two molecules of significantly different masses to represent the symbols {0,1} and perform readouts through noisy tandem mass spectrometry. Tandem mass spectrometers fragment the strings to be read into shorter substrings and only report their masses, often with errors due to imprecise ionization. Modeling the fragmentation process output in terms of composition multisets allows for designing asymptotically optimal codes capable of unique reconstruction and the correction of a single mass error [2] through the use of derivatives of Catalan paths. Nevertheless, no solutions for multiple-mass error-corrections are currently known. Our work addresses this issue by describing the first multiple-error correction codes that use the polynomial factorization approach for the Turnpike problem [4] and the related factorization described in [1]. Adding Reed-Solomon type coding redundancy into the corresponding polynomials allows for correcting t mass errors in polynomial time using ${\mathcal{O}}\left( {{t^2}\log k} \right)$ redundant bits, where k is the information string length. The redundancy can be improved to ${\mathcal{O}}(t + \log k)$. However, no decoding algorithm that runs polynomial-time in both t and n for this scheme are currently known, where n is the length of the coded string.

[1]  Alon Orlitsky,et al.  String Reconstruction from Substring Compositions , 2014, SIAM J. Discret. Math..

[2]  Jean-François Lutz,et al.  Mass spectrometry sequencing of long digital polymers facilitated by programmed inter-byte fragmentation , 2017, Nature Communications.

[3]  Olgica Milenkovic,et al.  Coded Trace Reconstruction , 2019, 2019 IEEE Information Theory Workshop (ITW).

[4]  Han Mao Kiah,et al.  Codes for DNA sequence profiles , 2015, 2015 IEEE International Symposium on Information Theory (ISIT).

[5]  Sampath Kannan,et al.  Reconstructing strings from random traces , 2004, SODA '04.

[6]  Robert N Grass,et al.  Robust chemical preservation of digital information on DNA in silica with error-correcting codes. , 2015, Angewandte Chemie.

[7]  Ewan Birney,et al.  Towards practical, high-capacity, low-maintenance information storage in synthesized DNA , 2013, Nature.

[8]  Olgica Milenkovic,et al.  Coding in 2D: Using Intentional Dispersity to Enhance the Information Capacity of Sequence-Coded Polymer Barcodes. , 2016, Angewandte Chemie.

[9]  Warren D. Smith,et al.  Reconstructing Sets From Interpoint Distances , 2003 .

[10]  Jian Ma,et al.  A Rewritable, Random-Access DNA-Based Storage System , 2015, Scientific Reports.

[11]  Olgica Milenkovic,et al.  DNA punch cards for storing data on native DNA sequences via enzymatic nicking , 2020, Nature Communications.

[12]  Krishnamurthy Viswanathan,et al.  Improved string reconstruction over insertion-deletion channels , 2008, SODA '08.

[13]  Vladimir I. Levenshtein,et al.  Efficient Reconstruction of Sequences from Their Subsequences or Supersequences , 2001, J. Comb. Theory A.

[14]  Olgica Milenkovic,et al.  Unique Reconstruction of Coded Sequences from Multiset Substring Spectra , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[15]  Olgica Milenkovic,et al.  Reconstruction and Error-Correction Codes for Polymer-Based Data Storage , 2019, 2019 IEEE Information Theory Workshop (ITW).

[16]  Olgica Milenkovic,et al.  Portable and Error-Free DNA-Based Data Storage , 2016, Scientific Reports.

[17]  Miroslav Dudík,et al.  Reconstruction from subsequences , 2003, J. Comb. Theory A.

[18]  Yuan Zhou Introduction to Coding Theory , 2010 .