Quadratic-backtracking algorithm for string reconstruction from substring compositions

Motivated by the problem of deducing the structure of proteins using mass-spectrometry, we study the reconstruction of a string from the multiset of its substring compositions. We specialize the backtracking algorithm used for the more general turnpike problem for string reconstruction. Employing well known results about transience of random walks in ≥ 3 dimensions, we show that the algorithm reconstructs random strings over alphabet size ≥ 4 with high probability in near-optimal quadratic time.

[1]  Steven Skiena,et al.  Reconstructing sets from interpoint distances (extended abstract) , 1990, SCG '90.

[2]  Krishnamurthy Viswanathan,et al.  Improved string reconstruction over insertion-deletion channels , 2008, SODA '08.

[3]  Shuhong Gao,et al.  Factoring multivariate polynomials via partial differential equations , 2003, Math. Comput..

[4]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[5]  Steven Skiena,et al.  Reconstructing strings from substrings in rounds , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[6]  D. Eisenberg Proteins. Structures and molecular properties, T.E. Creighton. W. H. Freeman and Company, New York (1984), 515, $36.95 , 1985 .

[7]  László Lovász,et al.  Factoring polynomials with rational coefficients , 1982 .

[8]  Alon Orlitsky,et al.  On reconstructing a string from its substring compositions , 2010, 2010 IEEE International Symposium on Information Theory.

[9]  Thomas H. Cormen,et al.  Introduction to algorithms [2nd ed.] , 2001 .

[10]  Arvind Gupta,et al.  On the turnpike problem , 2000 .

[11]  Sampath Kannan,et al.  Reconstructing strings from random traces , 2004, SODA '04.

[12]  Robert W. Cahn,et al.  Order in disorder , 1988, Nature.

[13]  Jeffrey Shallit,et al.  Counting Abelian Squares , 2008, Electron. J. Comb..

[14]  Zheng Zhang An Exponential Example for a Partial Digest Mapping Algorithm , 1994, J. Comput. Biol..

[15]  Dongbo Wang,et al.  Order from Disorder , 2012, Science.

[16]  P. Halling Proteins: Structures and molecular properties (2nd edition). by Thomas E. Creighton, W. H. Freeman, New York, 1992, xiii + 512 pp, price £22.95. ISBN 0‐7167‐7030‐X , 1995 .

[17]  Vladimir I. Levenshtein,et al.  Efficient Reconstruction of Sequences from Their Subsequences or Supersequences , 2001, J. Comb. Theory A.

[18]  D. Mount Bioinformatics: Sequence and Genome Analysis , 2001 .

[19]  Zsuzsanna Lipták,et al.  On Prefix Normal Words , 2011, Developments in Language Theory.

[20]  野崎 隆之,et al.  国際会議参加報告:IEEE International Symposium on Information Theory , 2015 .

[21]  Victor Shoup On the Deterministic Complexity of Factoring Polynomials over Finite Fields , 1990, Inf. Process. Lett..

[22]  David W. Mount,et al.  Bioinformatics - sequence and genome analysis (2. ed.) , 2004 .

[23]  Miroslav Dudík,et al.  Reconstruction from subsequences , 2003, J. Comb. Theory A.