Determining the unique decodability of a string in linear time

Determining whether an unordered collection of overlapping substrings (called shingles) can be uniquely decoded into a consistent string is a problem common to a broad assortment of disciplines ranging from networking and information theory through cryptography and even genetic engineering and linguistics. We present three perspectives on this problem: a graph theoretic framework due to Pevzner, an automata theoretic approach from our previous work, and a new insight that yields an efficient streaming algorithm for determining whether a string of n characters over the alphabet Σ can be uniquely decoded from its two-character shingles; our online algorithm achieves an overall time complexity Θ(n + |Σ|) and space complexity O(|Σ|). As an application, we demonstrate how this algorithm can be adapted to larger, varying-size shingles for (empirically) efficient string reconciliation.

[1]  Alexandre V. Evfimievski A probabilistic algorithm for updating files over a communication link , 1998, SODA '98.

[2]  Qiang Li,et al.  Finite automata for testing composition-based reconstructibility of sequences , 2008, J. Comput. Syst. Sci..

[3]  Pavel A. Pevzner,et al.  DNA physical mapping and alternating Eulerian cycles in colored graphs , 1995, Algorithmica.

[4]  David Tse,et al.  Information Theory of DNA Sequencing , 2012, ArXiv.

[5]  Avraham Adler,et al.  Lambert-W Function , 2015 .

[6]  Torsten Suel,et al.  Algorithms for Low-Latency Remote File Synchronization , 2008, IEEE INFOCOM 2008 - The 27th Conference on Computer Communications.

[7]  Rafail Ostrovsky,et al.  Fuzzy Extractors: How to Generate Strong Keys from Biometrics and Other Noisy Data , 2004, SIAM J. Comput..

[8]  Aryeh Kontorovich,et al.  String reconciliation with unknown edit distance , 2012, 2012 IEEE International Symposium on Information Theory Proceedings.

[9]  Alon Orlitsky,et al.  Interactive communication: balanced distributions, correlated files, and average-case complexity , 1991, [1991] Proceedings 32nd Annual Symposium of Foundations of Computer Science.

[10]  Alon Orlitsky Interactive Communication of Balanced Distributions and of Correlated Files , 1993, SIAM J. Discret. Math..

[11]  O. Antoine,et al.  Theory of Error-correcting Codes , 2022 .

[12]  Michael Luby,et al.  A digital fountain approach to reliable distribution of bulk data , 1998, SIGCOMM '98.

[13]  Haixu Tang,et al.  Fragment assembly with short reads , 2004, Bioinform..

[14]  Esko Ukkonen,et al.  Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..

[15]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[16]  Bailin Hao,et al.  Decomposition and Reconstruction of Protein Sequences: The Problem of Uniqueness and Factorizable Langauge , 2007 .

[17]  Daniel A. Spielman,et al.  Practical loss-resilient codes , 1997, STOC '97.

[18]  Sachin Agarwal,et al.  Bandwidth Efficient String Reconciliation Using Puzzles , 2006, IEEE Transactions on Parallel and Distributed Systems.

[19]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[20]  Uzi Vishkin,et al.  Communication complexity of document exchange , 1999, SODA '00.

[21]  Mark G. Karpovsky,et al.  Data verification and reconciliation with generalized error-control codes , 2003, IEEE Transactions on Information Theory.

[22]  Aryeh Kontorovich Uniquely decodable n-gram embeddings , 2004, Theor. Comput. Sci..

[23]  Yaron Minsky,et al.  Set reconciliation with nearly optimal communication complexity , 2003, IEEE Trans. Inf. Theory.

[24]  Alon Orlitsky,et al.  Practical protocols for interactive communication , 2001, Proceedings. 2001 IEEE International Symposium on Information Theory (IEEE Cat. No.01CH37252).

[25]  Andrew Tridgell,et al.  Efficient Algorithms for Sorting and Synchronization , 1999 .

[26]  Torsten Suel,et al.  Improved file synchronization techniques for maintaining large replicated collections over slow networks , 2004, Proceedings. 20th International Conference on Data Engineering.