Efficient data structures for internal queries in texts

This thesis is devoted to internal queries in texts, which ask to solve classic text-processing problems for substrings of a given text. More precisely, the task is to preprocess a static string T of length n (called the text) and construct a data structure answering certain questions about the substrings of T . The substrings involved in each query are specified in constant space by their occurrences in T , called fragments of T , identified by the start and the end positions. Components for internal queries often become parts of more complex data structures, and they are used in many algorithms for text processing. Longest Common Extension Queries, asking for the length of the longest common prefix of two substrings of the text T , are by far the most popular internal queries. They are used for checking if two fragments match (represent the same string) and for lexicographic comparison of substrings. Due to an optimal solution in the standard setting of texts over polynomially-bounded integer alphabets, with O(1)-time queries, O(n) size, and O(n) construction time, they have found numerous applications across stringology. In this dissertation, we provide the first optimal data structure for smaller alphabets of size σ n: it handles queries in O(1) time, takes O(n/ logσ n) space, and admits an O(n/ logσ n)-time construction (from the packed representation of T with Θ(logσ n) characters in each machine word). We then go back to alphabets of size σ polynomial in n and focus on more complex internal queries. Our first data structure supports Internal Pattern Matching Queries, which ask for the occurrences of one substring x within another substring y. After O(n)-time preprocessing of the text T , it answers these queries in time proportional to the quotient |y|/|x| of substrings’ lengths, which is required due to the information content of the output. We also use this data structure for Period Queries, asking for the periods of a given substring. Here, our logarithmic query time is also optimal by a similar information-theoretic argument. Further data structures are designed for Minimal Suffix and Minimal Rotation Queries, asking to compute the lexicographically smallest non-empty suffix and cyclic rotation of a given substring, respectively. They are answered in O(1) time after O(n)-time preprocessing. We also consider a more general problem of simulating the suffix array of a given substring (Substring Suffix Selection Queries, asking for the kth lexicographically smallest suffix of a substring) and its inverse suffix array (Substring Suffix Rank Queries, asking for the lexicographic rank of a substring’s suffix). Our data structure supports these queries in O(logn) time, takes O(n) space, and can be constructed in O(n √ logn) time. The tools developed in this dissertation additionally yield improved results for several kinds of Substring Compression Queries, which ask for the compressed representation of a given substring obtained using a specific method; we consider schemes based on the Lempel–Ziv parsing and the Burrows–Wheeler transform. Our results combine text-processing tools with combinatorics on words and state-of-the-art general-purpose data structures. The key technical contribution is a novel locally consistent symmetry-breaking scheme, formalized in terms of synchronizing functions, which is central to our solutions for Longest Common Extension Queries and Internal Pattern Matching Queries. 2012 ACM Subject Classification: Theory of computation→ Pattern matching

[1]  Pawel Gawrychowski,et al.  Sparse Suffix Tree Construction in Optimal Time and Space , 2017, SODA.

[2]  Pawel Gawrychowski,et al.  Pattern Matching in Lempel-Ziv Compressed Strings: Fast, Simple, and Deterministic , 2011, ESA.

[3]  Mihai Patrascu,et al.  Unifying the Landscape of Cell-Probe Lower Bounds , 2010, SIAM J. Comput..

[4]  Moshe Lewenstein,et al.  Generalized substring compression , 2009, Theor. Comput. Sci..

[5]  Michael G. Main,et al.  Detecting leftmost maximal periodicities , 1989, Discret. Appl. Math..

[6]  S. Muthukrishnan,et al.  Perfect Hashing for Strings: Formalization and Algorithms , 1996, CPM.

[7]  Mikhail Posypkin,et al.  Searching of gapped repeats and subrepetitions in a word , 2017, J. Discrete Algorithms.

[8]  Justin Zobel,et al.  Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval , 2010, SPIRE.

[9]  Nicola Prezza In-Place Sparse Suffix Sorting , 2018, SODA.

[10]  German Tischler On Wavelet Tree Construction , 2011, CPM.

[11]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[12]  Robert E. Tarjan,et al.  Fast Algorithms for Finding Nearest Common Ancestors , 1984, SIAM J. Comput..

[13]  Uzi Vishkin,et al.  Fast String Matching with k Differences , 1988, J. Comput. Syst. Sci..

[14]  Tero Harju,et al.  Combinatorics on Words , 2004 .

[15]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[16]  Wojciech Plandowski,et al.  Application of Lempel-Ziv Encodings to the Solution of Words Equations , 1998, ICALP.

[17]  Dmitry Kosolobov Tight lower bounds for the longest common extension problem , 2017, Inf. Process. Lett..

[18]  Leonidas J. Guibas,et al.  Periods in Strings , 1981, J. Comb. Theory, Ser. A.

[19]  Daniel Shawcross Wilkerson,et al.  Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[20]  Gonzalo Navarro,et al.  Compact Data Structures - A Practical Approach , 2016 .

[21]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[22]  Gonzalo Navarro,et al.  Text Indexing and Searching in Sublinear Time , 2020, CPM.

[23]  R. Lyndon On Burnside’s problem , 1954 .

[24]  Solon P. Pissis,et al.  Longest Unbordered Factor in Quasilinear Time , 2018, ISAAC.

[25]  Dmitry Kosolobov Computing runs on a general alphabet , 2016, Inf. Process. Lett..

[26]  Hideo Bannai,et al.  Small-Space LCE Data Structure with Constant-Time Queries , 2017, MFCS.

[27]  Ron Shamir,et al.  Improving the performance of minimizers and winnowing schemes , 2017, bioRxiv.

[28]  Uzi Vishkin,et al.  On a Parallel-Algorithms Method for String Matching Problems , 1994, CIAC.

[29]  Shunsuke Inenaga,et al.  Tighter Bounds and Optimal Algorithms for All Maximal α-gapped Repeats and Palindromes , 2017, Theory of Computing Systems.

[30]  Neri Merhav,et al.  A measure of relative entropy between individual sequences with application to universal classification , 1993, IEEE Trans. Inf. Theory.

[31]  Gonzalo Navarro,et al.  Sorted Range Reporting , 2012, SWAT.

[32]  Ayumi Shinohara,et al.  Detecting Regularities on Grammar-Compressed Strings , 2013, MFCS.

[33]  Marcin Mucha,et al.  Lyndon Words and Short Superstrings , 2012, SODA.

[34]  Jeffrey Scott Vitter,et al.  Fast Construction of Wavelet Trees , 2014, SPIRE.

[35]  Gregory Kucherov,et al.  Cross-Document Pattern Matching , 2012, CPM.

[36]  Hervé Brönnimann,et al.  New payload attribution methods for network forensic investigations , 2010, TSEC.

[37]  Piotr Sankowski,et al.  Optimal Dynamic Strings , 2015, SODA.

[38]  Gad M. Landau,et al.  Fast Parallel and Serial Approximate String Matching , 1989, J. Algorithms.

[39]  Wojciech Rytter,et al.  A linear time algorithm for consecutive permutation pattern matching , 2013, Inf. Process. Lett..

[40]  Thierry Lecroq,et al.  The exact online string matching problem: A review of the most recent results , 2013, CSUR.

[41]  Gonzalo Navarro,et al.  Wavelet trees for all , 2012, J. Discrete Algorithms.

[42]  Milan Ruzic,et al.  Constructing Efficient Dictionaries in Close to Sorting Time , 2008, ICALP.

[43]  Artur Jez,et al.  Faster Fully Compressed Pattern Matching by Recompression , 2011, ICALP.

[44]  Allan Grønlund Jørgensen,et al.  Range selection and median: tight cell probe lower bounds and adaptive data structures , 2011, SODA '11.

[45]  Wojciech Rytter,et al.  Efficient Data Structures for the Factor Periodicity Problem , 2012, SPIRE.

[46]  Florin Manea,et al.  Detecting One-Variable Patterns , 2017, SPIRE.

[47]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[48]  Torben Hagerup,et al.  Sorting and Searching on the Word RAM , 1998, STACS.

[49]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[50]  Tomasz Kociumaka Minimal Suffix and Rotation of a Substring in Optimal Time , 2016, CPM.

[51]  Jean Pierre Duval,et al.  Factorizing Words over an Ordered Alphabet , 1983, J. Algorithms.

[52]  Hideo Bannai,et al.  Deterministic sub-linear space LCE data structures with efficient construction , 2016, CPM.

[53]  Moshe Lewenstein Orthogonal Range Searching for Text Indexing , 2013, Space-Efficient Data Structures, Streams, and Algorithms.

[54]  Rudolf Fleischer,et al.  Order Preserving Matching , 2013, Theor. Comput. Sci..

[55]  Roberto Grossi,et al.  When indexing equals compression: experiments with compressing suffix arrays and applications , 2004, SODA '04.

[56]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[57]  Giuseppe Ottaviano,et al.  The wavelet trie: maintaining an indexed sequence of strings in compressed space , 2012, PODS '12.

[58]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[59]  Z Galil,et al.  Improved string matching with k mismatches , 1986, SIGA.

[60]  Michael Roberts,et al.  Reducing storage requirements for biological sequence comparison , 2004, Bioinform..

[61]  Kurt Mehlhorn,et al.  Maintaining dynamic sequences under equality tests in polylogarithmic time , 1994, SODA '94.

[62]  Markus Lohrey,et al.  Algorithmics on SLP-compressed strings: A survey , 2012, Groups Complex. Cryptol..

[63]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[64]  Michael L. Fredman,et al.  Surpassing the Information Theoretic Bound with Fusion Trees , 1993, J. Comput. Syst. Sci..

[65]  Moshe Lewenstein,et al.  Weighted Ancestors in Suffix Trees , 2014, ESA.

[66]  S. Muthukrishnan,et al.  On the sorting-complexity of suffix tree construction , 2000, JACM.

[67]  Johannes Fischer,et al.  Deterministic Sparse Suffix Sorting on Rewritable Texts , 2016, LATIN.

[68]  H. Wilf,et al.  Uniqueness theorems for periodic functions , 1965 .

[69]  Yossi Shiloach,et al.  Fast Canonization of Circular Strings , 1981, J. Algorithms.

[70]  Costas S. Iliopoulos,et al.  Parallel RAM Algorithms for Factorizing Words , 1994, Theor. Comput. Sci..

[71]  Gelin Zhou,et al.  Two-dimensional range successor in optimal time and almost linear space , 2016, Inf. Process. Lett..

[72]  William F. Smyth,et al.  Computing regularities in strings: A survey , 2013, Eur. J. Comb..

[73]  Uzi Vishkin,et al.  Efficient approximate and dynamic matching of patterns using a labeling paradigm , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[74]  Hideo Bannai,et al.  Faster Lyndon Factorization Algorithms for SLP and LZ78 Compressed Text , 2013, SPIRE.

[75]  I Tomohiro,et al.  Longest Common Extensions with Recompression , 2016, CPM.

[76]  Artur Jez,et al.  Recompression: a simple and powerful technique for word equations , 2012, STACS.

[77]  Guy Jacobson,et al.  Space-efficient static trees and graphs , 1989, 30th Annual Symposium on Foundations of Computer Science.

[78]  Gregory Kucherov,et al.  Finding maximal repetitions in a word in linear time , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[79]  Rahul Shah,et al.  Faster Range LCP Queries , 2013, SPIRE.

[80]  Mihai Patrascu Lower bounds for 2-dimensional range counting , 2007, STOC '07.

[81]  Wojciech Rytter,et al.  Faster Longest Common Extension Queries in Strings over General Alphabets , 2016, CPM.

[82]  Uzi Vishkin,et al.  Symmetry breaking for suffix tree construction , 1994, STOC '94.

[83]  Artur Jez,et al.  Edit Distance with Block Operations , 2018, ESA.

[84]  Joseph JáJá,et al.  Space-Efficient and Fast Algorithms for Multidimensional Dominance Reporting and Counting , 2004, ISAAC.

[85]  Johannes Gehrke,et al.  Plagiarism Detection in arXiv , 2006, Sixth International Conference on Data Mining (ICDM'06).

[86]  Ron Shamir,et al.  Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing , 2017, PLoS Comput. Biol..

[87]  Mikkel Thorup,et al.  Dynamic Integer Sets with Optimal Rank, Select, and Predecessor Search , 2014, 2014 IEEE 55th Annual Symposium on Foundations of Computer Science.

[88]  Hideo Bannai,et al.  Fully Dynamic Data Structure for LCE Queries in Compressed Space , 2016, MFCS.

[89]  Wojciech Rytter,et al.  Internal Pattern Matching Queries in a Text and Applications , 2013, SODA.

[90]  Peter Sanders,et al.  Linear work suffix array construction , 2006, JACM.

[91]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[92]  Sebastian Deorowicz,et al.  KMC 2: Fast and resource-frugal k-mer counting , 2014, Bioinform..