Counting Distinct Patterns in Internal Dictionary Matching

We consider the problem of preprocessing a text $T$ of length $n$ and a dictionary $\mathcal{D}$ in order to be able to efficiently answer queries $CountDistinct(i,j)$, that is, given $i$ and $j$ return the number of patterns from $\mathcal{D}$ that occur in the fragment $T[i \mathinner{.\,.} j]$. The dictionary is internal in the sense that each pattern in $\mathcal{D}$ is given as a fragment of $T$. This way, the dictionary takes space proportional to the number of patterns $d=|\mathcal{D}|$ rather than their total length, which could be $\Theta(n\cdot d)$. An $\tilde{\mathcal{O}}(n+d)$-size data structure that answers $CountDistinct(i,j)$ queries $\mathcal{O}(\log n)$-approximately in $\tilde{\mathcal{O}}(1)$ time was recently proposed in a work that introduced internal dictionary matching [ISAAC 2019]. Here we present an $\tilde{\mathcal{O}}(n+d)$-size data structure that answers $CountDistinct(i,j)$ queries $2$-approximately in $\tilde{\mathcal{O}}(1)$ time. Using range queries, for any $m$, we give an $\tilde{\mathcal{O}}(\min(nd/m,n^2/m^2)+d)$-size data structure that answers $CountDistinct(i,j)$ queries exactly in $\tilde{\mathcal{O}}(m)$ time. We also consider the special case when the dictionary consists of all square factors of the string. We design an $\mathcal{O}(n \log^2 n)$-size data structure that allows us to count distinct squares in a text fragment $T[i \mathinner{.\,.} j]$ in $\mathcal{O}(\log n)$ time.

[1]  Wojciech Rytter,et al.  Internal Dictionary Matching , 2019, ISAAC.

[2]  Tomasz Kociumaka Efficient data structures for internal queries in texts , 2019 .

[3]  Tomasz Kociumaka Minimal Suffix and Rotation of a Substring in Optimal Time , 2016, CPM.

[4]  Mikkel Thorup Space efficient dynamic stabbing with fast queries , 2003, STOC '03.

[5]  Gad M. Landau,et al.  Dynamic text and static pattern matching , 2007, TALG.

[6]  Frantisek Franek,et al.  How many double squares can a string contain? , 2015, Discret. Appl. Math..

[7]  Wojciech Rytter,et al.  Internal Pattern Matching Queries in a Text and Applications , 2013, SODA.

[8]  Gregory Kucherov,et al.  Finding maximal repetitions in a word in linear time , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[9]  Dan E. Willard,et al.  Log-logarithmic worst-case range queries are possible in space ⊕(N) , 1983 .

[10]  Moshe Lewenstein,et al.  Generalized substring compression , 2009, Theor. Comput. Sci..

[11]  Milan Ruzic,et al.  Constructing Efficient Dictionaries in Close to Sorting Time , 2008, ICALP.

[12]  Monika Henzinger,et al.  Unifying and Strengthening Hardness for Dynamic Problems via the Online Matrix-Vector Multiplication Conjecture , 2015, STOC.

[13]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[14]  Jens Stoye,et al.  Linear time algorithms for finding and representing all the tandem repeats in a string , 2004, J. Comput. Syst. Sci..

[15]  Timothy M. Chan,et al.  Counting inversions, offline orthogonal range counting, and related problems , 2010, SODA '10.

[16]  Maxime Crochemore,et al.  An Optimal Algorithm for Computing the Repetitions in a Word , 1981, Inf. Process. Lett..

[17]  Jens Stoye,et al.  Simple and flexible detection of contiguous repeats using a suffix tree , 2002, Theor. Comput. Sci..

[18]  Jamie Simpson,et al.  The total run length of a word , 2013, Theor. Comput. Sci..

[19]  Aviezri S. Fraenkel,et al.  How Many Squares Can a String Contain? , 1998, J. Comb. Theory, Ser. A.

[20]  H. Wilf,et al.  Uniqueness theorems for periodic functions , 1965 .

[21]  Haim Kaplan,et al.  Efficient Colored Orthogonal Range Counting , 2008, SIAM J. Comput..

[22]  Wojciech Rytter,et al.  Extracting powers and periods in a word from its runs structure , 2014, Theor. Comput. Sci..

[23]  Arseny M. Shur,et al.  Counting Palindromes in Substrings , 2017, SPIRE.

[24]  Timothy M. Chan,et al.  Dynamic Orthogonal Range Searching on the RAM, Revisited , 2017, SoCG.

[25]  Kazuya Tsuruta,et al.  The "Runs" Theorem , 2014, SIAM J. Comput..