Approximate Trace Reconstruction

In the usual trace reconstruction problem, the goal is to exactly reconstruct an unknown string of length $n$ after it passes through a deletion channel many times independently, producing a set of traces (i.e., random subsequences of the string). We consider the relaxed problem of approximate reconstruction. Here, the goal is to output a string that is close to the original one in edit distance while using much fewer traces than is needed for exact reconstruction. We present several algorithms that can approximately reconstruct strings that belong to certain classes, where the estimate is within $n/\mathrm{polylog}(n)$ edit distance, and where we only use $\mathrm{polylog}(n)$ traces (or sometimes just a single trace). These classes contain strings that require a linear number of traces for exact reconstruction and which are quite different from a typical random string. From a technical point of view, our algorithms approximately reconstruct consecutive substrings of the unknown string by aligning dense regions of traces and using a run of a suitable length to approximate each region. To complement our algorithms, we present a general black-box lower bound for approximate reconstruction, building on a lower bound for distinguishing between two candidate input strings in the worst case. In particular, this shows that approximating to within $n^{1/3 - \delta}$ edit distance requires $n^{1 + 3\delta/2}/\mathrm{polylog}(n)$ traces for $0< \delta < 1/3$ in the worst case.

[1]  Eitan Yaakobi,et al.  Coding for Sequence Reconstruction for Single Edits , 2020, 2020 IEEE International Symposium on Information Theory (ISIT).

[2]  Krishnamurthy Viswanathan,et al.  Improved string reconstruction over insertion-deletion channels , 2008, SODA '08.

[3]  Russell Lyons,et al.  Lower bounds for trace reconstruction , 2018, ArXiv.

[4]  Zachary Chase New lower bounds for trace reconstruction , 2021 .

[5]  Eitan Yaakobi,et al.  Optimal Reconstruction Codes for Deletion Channels , 2020, 2020 International Symposium on Information Theory and Its Applications (ISITA).

[6]  Suhas N. Diggavi,et al.  On Maximum Likelihood Reconstruction over Multiple Deletion Channels , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[7]  G. Church,et al.  Next-Generation Digital Information Storage in DNA , 2012, Science.

[8]  Shyam Narayanan,et al.  Circular Trace Reconstruction , 2020, ArXiv.

[9]  Ewan Birney,et al.  Towards practical, high-capacity, low-maintenance information storage in synthesized DNA , 2013, Nature.

[10]  Yuval Peres,et al.  Average-Case Reconstruction for the Deletion Channel: Subpolynomially Many Traces Suffice , 2017, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[11]  Madhu Sudan,et al.  Limitations of Mean-Based Algorithms for Trace Reconstruction at Small Distance , 2020, 2021 IEEE International Symposium on Information Theory (ISIT).

[12]  Cyrus Rashtchian,et al.  Random access in large-scale DNA data storage , 2018, Nature Biotechnology.

[13]  Rocco A. Servedio,et al.  Efficient average-case population recovery in the presence of insertions and deletions , 2019, APPROX-RANDOM.

[14]  Michael Mitzenmacher,et al.  A Survey of Results for Deletion Channels and Related Synchronization Channels , 2008, SWAT.

[15]  Olgica Milenkovic,et al.  Portable and Error-Free DNA-Based Data Storage , 2016, Scientific Reports.

[16]  Akshay Krishnamurthy,et al.  Trace Reconstruction: Generalized and Parameterized , 2019, ESA.

[17]  Yuval Peres,et al.  Trace reconstruction with exp(O(n1/3)) samples , 2017, STOC.

[18]  Ryan O'Donnell,et al.  Optimal mean-based algorithms for trace reconstruction , 2017, STOC.

[19]  Shyam Narayanan,et al.  Population Recovery from the Deletion Channel: Nearly Matching Trace Reconstruction Bounds , 2020, ArXiv.

[20]  Christina Fragouli,et al.  Algorithms for Reconstruction Over Single and Multiple Deletion Channels , 2020, IEEE Transactions on Information Theory.

[21]  Ilia Krasikov,et al.  On a Reconstruction Problem for Sequences, , 1997, J. Comb. Theory A.

[22]  Rocco A. Servedio,et al.  Beyond Trace Reconstruction: Population Recovery from the Deletion Channel , 2019, 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS).

[23]  Yuval Peres,et al.  Trace reconstruction with varying deletion probabilities , 2018, ANALCO.

[24]  Yuval Peres,et al.  Subpolynomial trace reconstruction for random strings and arbitrary deletion probability , 2018, COLT.

[25]  Olgica Milenkovic,et al.  Coded Trace Reconstruction , 2019, 2019 IEEE Information Theory Workshop (ITW).

[26]  Vladimir I. Levenshtein,et al.  Efficient Reconstruction of Sequences from Their Subsequences or Supersequences , 2001, J. Comb. Theory A.

[27]  Zachary Chase New Upper Bounds for Trace Reconstruction , 2020, ArXiv.

[28]  Alexandr Andoni,et al.  Global Alignment of Molecular Sequences via Ancestral State Reconstruction , 2009, ICS.

[29]  Bruce Spang,et al.  Coded trace reconstruction in a constant number of traces , 2020, 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS).

[30]  Luis Ceze,et al.  DNA assembly for nanopore data storage readout , 2019, Nature Communications.

[31]  Venkatesan Guruswami,et al.  Optimally resilient codes for list-decoding from insertions and deletions , 2019, Electron. Colloquium Comput. Complex..

[32]  Eitan Yaakobi,et al.  The Error Probability of Maximum-Likelihood Decoding over Two Deletion Channels , 2020, ArXiv.

[33]  Cyrus Rashtchian,et al.  Trace Reconstruction Problems in Computational Biology , 2020, ArXiv.

[34]  Rocco A. Servedio,et al.  Polynomial-time trace reconstruction in the smoothed complexity model , 2020, ArXiv.

[35]  Michael Mitzenmacher,et al.  Repeated deletion channels , 2014, 2014 IEEE Information Theory Workshop (ITW 2014).

[36]  Michael Mitzenmacher,et al.  Improved Lower Bounds for the Capacity of i.i.d. Deletion and Duplication Channels , 2007, IEEE Transactions on Information Theory.

[37]  Rina Panigrahy,et al.  Trace reconstruction with constant deletion probability and related results , 2008, SODA '08.

[38]  Lara Dolecek,et al.  Coding for Deletion Channels with Multiple Traces , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[39]  Sofya Vorotnikova,et al.  Trace Reconstruction Revisited , 2014, ESA.

[40]  Sampath Kannan,et al.  Reconstructing strings from random traces , 2004, SODA '04.

[41]  Cyrus Rashtchian,et al.  Reconstructing Trees from Traces , 2019, COLT.