Duplication Distance to the Root for Binary Sequences

We study the tandem duplication distance between binary sequences and their roots. In other words, the quantity of interest is the number of tandem duplication operations of the form <inline-formula> <tex-math notation="LaTeX">$ {x}_{} = {a}_{} {b}_{} {c}_{} \to {y}_{} = {a}_{} {b}_{} {b}_{} {c}_{}$ </tex-math></inline-formula>, where <inline-formula> <tex-math notation="LaTeX">$ {x}_{}$ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$ {y}_{}$ </tex-math></inline-formula> are sequences and <inline-formula> <tex-math notation="LaTeX">$ {a}_{}$ </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">$ {b}_{}$ </tex-math></inline-formula>, and <inline-formula> <tex-math notation="LaTeX">$ {c}_{}$ </tex-math></inline-formula> are their substrings, needed to generate a binary sequence of length <inline-formula> <tex-math notation="LaTeX">$n$ </tex-math></inline-formula> starting from a square-free sequence from the set {0, 1, 01, 10, 010, 101}. This problem is a restricted case of finding the duplication/deduplication distance between two sequences, defined as the minimum number of duplication and deduplication operations required to transform one sequence to the other. We consider both exact and approximate tandem duplications. For exact duplication, denoting the maximum distance to the root of a sequence of length <inline-formula> <tex-math notation="LaTeX">$n$ </tex-math></inline-formula> by <inline-formula> <tex-math notation="LaTeX">$f(n)$ </tex-math></inline-formula>, we prove that <inline-formula> <tex-math notation="LaTeX">$f(n)=\Theta (n)$ </tex-math></inline-formula>. For the case of approximate duplication, where a <inline-formula> <tex-math notation="LaTeX">$\beta $ </tex-math></inline-formula>-fraction of symbols may be duplicated incorrectly, we show that the maximum distance has a sharp transition from linear in <inline-formula> <tex-math notation="LaTeX">$n$ </tex-math></inline-formula> to logarithmic at <inline-formula> <tex-math notation="LaTeX">$\beta =1/2$ </tex-math></inline-formula>. We also study the duplication distance to the root for the set of sequences arising from a given root and for special classes of sequences, namely, the De Bruijn sequences, the Thue–Morse sequence, and the Fibonacci words. The problem is motivated by genomic tandem duplication mutations and the smallest number of tandem duplication events required to generate a given biological sequence.

[1]  Jehoshua Bruck,et al.  Duplication-correcting codes for data storage in the DNA of living organisms , 2016, 2016 IEEE International Symposium on Information Theory (ISIT).

[2]  de Ng Dick Bruijn A combinatorial problem , 1946 .

[3]  Jehoshua Bruck,et al.  The Capacity of String-Duplication Systems , 2016, IEEE Transactions on Information Theory.

[4]  J. Steele Probability theory and combinatorial optimization , 1987 .

[5]  Ron M. Roth,et al.  Introduction to Coding Theory , 2019, Discrete Mathematics.

[6]  O. Antoine,et al.  Theory of Error-correcting Codes , 2022 .

[7]  Olivier Gascuel,et al.  Mathematics of Evolution and Phylogeny , 2005 .

[8]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[9]  Jehoshua Bruck,et al.  The capacity of some Pólya string models , 2016, 2016 IEEE International Symposium on Information Theory (ISIT).

[10]  A. Lindenmayer Mathematical models for cellular interactions in development. II. Simple and branching filaments with two-sided inputs. , 1968, Journal of theoretical biology.

[11]  Jehoshua Bruck,et al.  Capacity and Expressiveness of Genomic Tandem Duplication , 2015, IEEE Transactions on Information Theory.

[12]  Shibu Yooseph,et al.  Zinc finger gene clusters and tandem gene duplication , 2001, J. Comput. Biol..

[13]  R. C. ENTRINGER,et al.  On Nonrepetitive Sequences , 1974, J. Comb. Theory, Ser. A.

[14]  Przemyslaw Prusinkiewicz,et al.  The Algorithmic Beauty of Plants , 1990, The Virtual Laboratory.

[15]  Jehoshua Bruck,et al.  A stochastic model for genomic interspersed duplication , 2015, 2015 IEEE International Symposium on Information Theory (ISIT).

[16]  Yuan Zhou Introduction to Coding Theory , 2010 .

[17]  Koninklijke Nederlandse Akademie van Wetenschappen Proceedings of the Section of Sciences , 2017 .

[18]  Gary Benson,et al.  Reconstructing the Duplication History of a Tandem Repeat , 1999, ISMB.

[19]  Aristid Lindenmayer,et al.  Mathematical Models for Cellular Interactions in Development , 1968 .