Shortest Unique Substring Queries on Run-Length Encoded Strings

We consider the problem of answering shortest unique substring (SUS) queries on run-length encoded strings. For a string S, a unique substring u = S[i..j] is said to be a shortest unique substring (SUS) of S containing an interval [s, t] (i j'-i', S[i'..j'] occurs at least twice in S. Given a run-length encoding of size m of a string of length N, we show that we can construct a data structure of size O(m+pi_s(N, m)) in O(m log m + pi_c(N, m)) time such that queries can be answered in O(pi_q(N, m) + k) time, where k is the size of the output (the number of SUSs), and pi_s(N,m), pi_c(N,m), pi_q(N,m) are, respectively, the size, construction time, and query time for a predecessor/successor query data structure of m elements for the universe of [1,N]. Using the data structure by Beam and Fich (JCSS 2002), this results in a data structure of O(m) space that is constructed in O(m log m) time, and answers queries in O(sqrt(log m/loglog m)+k) time.

[1]  Kuan-Yu Chen,et al.  Efficient retrieval of approximate palindromes in a run-length encoded string , 2012, Theor. Comput. Sci..

[2]  Hideo Bannai,et al.  An Opportunistic Text Indexing Structure Based on Run Length Encoding , 2015, CIAC.

[3]  Gad M. Landau,et al.  Computing Similarity of Run-Length Encoded Strings with Affine Gap Penalty , 2005, SPIRE.

[4]  Friedrich Möller,et al.  Genome comparison without alignment using shortest unique substrings , 2005, BMC Bioinformatics.

[5]  János Csirik,et al.  An algorithm for matching run-length coded strings , 1993, Computing.

[6]  Michael A. Bender,et al.  The LCA Problem Revisited , 2000, LATIN.

[7]  Jian Pei,et al.  On shortest unique substring queries , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[8]  Wing-Kai Hon,et al.  An In-place Framework for Exact and Approximate Shortest Unique Substring Queries , 2015, ISAAC.

[9]  Faith Ellen,et al.  Optimal Bounds for the Predecessor Problem and Related Problems , 2002, J. Comput. Syst. Sci..

[10]  Bojian Xu,et al.  A simple yet time-optimal and linear-space algorithm for shortest unique substring queries , 2015, Theor. Comput. Sci..

[11]  Zsuzsanna Lipták,et al.  Binary jumbled string matching for highly run-length compressible texts , 2013, Inf. Process. Lett..

[12]  Gad M. Landau,et al.  Matching for Run-Length Encoded Strings , 1999, J. Complex..

[13]  Kazuya Tsuruta,et al.  Shortest Unique Substrings Queries in Optimal Time , 2014, SOFSEM.

[14]  János Csirik,et al.  An Improved Algorithm for Computing the Edit Distance of Run-Length Coded Strings , 1995, Inf. Process. Lett..

[15]  Gad M. Landau,et al.  Edit distance of run-length encoded strings , 2002, Inf. Process. Lett..

[16]  Gonzalo Navarro,et al.  Approximate Matching of Run-Length Compressed Strings , 2001, CPM.

[17]  Jian Pei,et al.  Shortest Unique Queries on Strings , 2014, SPIRE.

[18]  Gary Benson,et al.  Let sleeping files lie: pattern matching in Z-compressed files , 1994, SODA '94.

[19]  Y. L. Wang,et al.  A fast algorithm for finding the positions of all squares in a run-length encoded string , 2009, Theor. Comput. Sci..