Adaptive learning of compressible strings

Suppose an oracle knows a string S that is unknown to us and that we want to determine. The oracle can answer queries of the form “Is s a substring of S?”. In 1995, Skiena and Sundaram showed that, in the worst case, any algorithm needs to ask the oracle σn/4−O(n) queries in order to be able to reconstruct the hidden string, where σ is the size of the alphabet of S and n its length, and gave an algorithm that spends (σ−1)n+O(σ√n) queries to reconstruct S. The main contribution of our paper is to improve the above upper-bound in the context where the string is compressible. We first present a universal algorithm that, given a (computable) compressor that compresses the string to τ bits, performs q = O(τ) substring queries; this algorithm, however, runs in exponential time. For this reason, the second part of the paper focuses on more time-efficient algorithms whose number of queries is bounded by specific compressibility measures. We first show that any string of length n over an integer alphabet of size σ with rle runs can be reconstructed with q = O(rle(σ + log n rle )) substring queries in linear time and space. We then present an algorithm that spends q ∈ O(σg log n) substring queries and runs in O(n(logn + log σ) + q) time using linear space, where g is the size of a smallest straight-line program generating the string.

[1]  Aldo de Luca,et al.  Words and special factors , 2001, Theor. Comput. Sci..

[2]  Alon Orlitsky,et al.  String Reconstruction from Substring Compositions , 2014, SIAM J. Discret. Math..

[3]  Michael A. Bender,et al.  Cache-oblivious string B-trees , 2006, PODS '06.

[4]  Amihood Amir,et al.  Adaptive Exact Learning in a Mixed-Up World: Dealing with Periodicity, Errors, and Jumbled-Index Queries in String Reconstruction , 2020, SPIRE.

[5]  Francis Dominick Murgolo Approximation algorithms for combinatorial optimization problems , 1985 .

[6]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[7]  Andreas W. M. Dress,et al.  Reconstructing Words from Subwords in Linear Time , 2005 .

[8]  P. Alam ‘A’ , 2021, Composites Engineering: An A–Z Guide.

[9]  Tao Jiang,et al.  DNA sequencing and string learning , 2005, Mathematical systems theory.

[10]  Moni Naor String Matching with Preprocessing of Text and Pattern , 1991, ICALP.

[11]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[12]  Davide Della Giustina,et al.  A New Linear-Time Algorithm for Centroid Decomposition , 2019, SPIRE.

[13]  Steven Skiena,et al.  Reconstructing strings from substrings in rounds , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[14]  C. Jordan Sur les assemblages de lignes. , 1869 .

[15]  Paolo Ferragina,et al.  Compressed Cache-Oblivious String B-Tree , 2013, ESA.

[16]  Kazuo Iwama,et al.  Reconstructing Strings from Substrings: Optimal Randomized and Average-Case Algorithms , 2018, ArXiv.

[17]  Anna Pagh,et al.  The Complexity of Constructing Evolutionary Trees Using Experiments , 2001, ICALP.

[18]  Steven Skiena,et al.  Reconstructing Strings from Substrings , 1995, J. Comput. Biol..

[19]  Dekel Tsur Tight Bounds for String Reconstruction Using Substring Queries , 2005, APPROX-RANDOM.

[20]  Gonzalo Navarro Indexing Highly Repetitive String Collections , 2020, ArXiv.

[21]  Antonio Restivo,et al.  Word assembly through minimal forbidden words , 2006, Theor. Comput. Sci..