Longest k-tuple Common Sub-Strings

We focus on a new problem that is formulated to find a longest k-tuple of common sub-strings (abbr. k-CSSs) of two or more strings. We present a suffix tree based algorithm for this problem, which can find a longest k-CSS of m strings in $O(kmn^{k})$ time and $O(kmn)$ space where n is the length sum of the m strings. This algorithm can be used to approximate the longest k-CSS problem to a performance ratio $\frac{1}{\epsilon}$ in $O(kmn^{\lceil\epsilon k\rceil})$ time for $\epsilon\in(0,1]$. Since the algorithm has the space complexity in linear order of n, it will show advantage in comparing particularly long strings. This algorithm proves that the problem that asks to find a longest gapped pattern of non-constant number of strings is polynomial time solvable if the gap number is restricted constant, although the problem without any restriction on the gap number was proved NP-Hard. Using a C++ tool that is reliant on the algorithm, we performed experiments of finding longest 2-CSSs, 3-CSSs and 5-CSSs of 2 ~ 14 COVID-19 S-proteins. Under the help of longest 2-CSSs and 3-CSSs of COVID-19 S-proteins, we identified the mutation sites in the S-proteins of two COVID-19 variants Delta and Omicron. The algorithm based tool is available for downloading at https://github.com/lytt0/k-CSS.

[1]  Ralf Bartenschlager,et al.  Structures and distributions of SARS-CoV-2 spike proteins on intact virions , 2020, Nature.

[2]  Min Zheng,et al.  An overview of COVID-19 , 2020, Journal of Zhejiang University-SCIENCE B.

[3]  E. Holmes,et al.  A new coronavirus associated with human respiratory disease in China , 2020, Nature.

[4]  Costas S. Iliopoulos,et al.  Longest Common Prefixes with k-Mismatches and Applications , 2018, SOFSEM.

[5]  Giovanni Manzini,et al.  Longest Common Prefix with Mismatches , 2015, SPIRE.

[6]  Esko Ukkonen,et al.  Longest common substrings with k mismatches , 2014, Inf. Process. Lett..

[7]  Eli Upfal,et al.  MADMX: A Novel Strategy for Maximal Dense Motif Extraction , 2009, WABI.

[8]  Esko Ukkonen,et al.  On the complexity of finding gapped motifs , 2008, J. Discrete Algorithms.

[9]  Maxime Crochemore,et al.  Longest repeats with a block of k don't cares , 2006, Theor. Comput. Sci..

[10]  James A. M. McHugh,et al.  A first approach to finding common motifs with gaps , 2005, Int. J. Found. Comput. Sci..

[11]  Siu-Ming Yiu,et al.  Finding Motifs with Insufficient Number of Strong Binding Sites , 2005, J. Comput. Biol..

[12]  Louxin Zhang,et al.  Distinguishing string selection problems , 2003, Inf. Comput..

[13]  Marie-France Sagot,et al.  Extracting structured motifs using a suffix tree—algorithms and application to promoter consensus identification , 2000, RECOMB '00.

[14]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[15]  J. Storer Data Compression: Methods and Theory , 1987 .

[16]  Ma Jun Analysis of the Longest Common Substring Algorithm , 2007 .

[17]  D. Gusfield Algorithms on Stings, Trees, and Sequences: Computer Science and Computational Biology , 1997, SIGA.