Topics in combinatorial pattern matching

This dissertation studies problems in the general theme of combinatorial pattern matching. More specifically, we study the following topics: Longest Common Extensions. We revisit the longest common extension (LCE) problem, that is, preprocess a string T into a compact data structure that supports fast LCE queries. An LCE query takes a pair (i, j) of indices in T and returns the length of the longest common prefix of the suffixes of T starting at positions i and j. Such queries are also commonly known as longest common prefix (LCP) queries. We study the time-space trade-offs for the problem, that is, the space used for the data structure vs. the worst-case time for answering an LCE query. Let n be the length of T . Given a parameter τ , 1 ≤ τ ≤ n, we show how to achieve either O(n/√τ) space and O(τ) query time, or O(n/τ) space and O(τ log(|LCE(i, j)|/τ)) query time, where |LCE(i, j)| denotes the length of the LCE returned by the query. These bounds provide the first smooth trade-offs for the LCE problem and almost match the previously known bounds at the extremes when τ = 1 or τ = n. We apply the result to obtain improved bounds for several applications where the LCE problem is the computational bottleneck, including approximate string matching and computing palindromes. We also present an efficient technique to reduce LCE queries on two strings to one string. Finally, we give a lower bound on the time-space product for LCE data structures in the non-uniform cell probe model showing that our second trade-off is nearly optimal. Fingerprints in Compressed Strings. The Karp-Rabin fingerprint of a string is a type of hash value that due to its strong properties has been used in many string algorithms. We show how to construct a data structure for a string S of size N compressed by a context-free grammar of size n that supports fingerprint queries. That is, given indices i and j, the answer to a query is the

[1]  Ayumi Shinohara,et al.  An Improved Pattern Matching Algorithm for Strings in Terms of Straight-Line Programs , 1997, CPM.

[2]  Maxime Crochemore,et al.  Forty Years of Text Indexing , 2013, CPM.

[3]  Alistair Moffat,et al.  Off-line dictionary-based compression , 1999, Proceedings of the IEEE.

[4]  Andrzej Ehrenfeucht,et al.  Position heaps: A simple and dynamic text indexing data structure , 2011, J. Discrete Algorithms.

[5]  H. Wilf,et al.  Uniqueness theorems for periodic functions , 1965 .

[6]  Peter Bro Miltersen Cell probe complexity-a survey , 1999 .

[7]  Torben Hagerup,et al.  Sorting and Searching on the Word RAM , 1998, STACS.

[8]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[9]  Udi Manber A text compression scheme that allows fast searching directly in the compressed file , 1997, TOIS.

[10]  Alexander Tiskin,et al.  Threshold Approximate Matching in Grammar-Compressed Strings , 2014, Stringology.

[11]  Robert E. Tarjan,et al.  Scaling and related techniques for geometry problems , 1984, STOC '84.

[12]  Ayumi Shinohara,et al.  Inferring Strings from Graphs and Arrays , 2003, MFCS.

[13]  Philip Bille,et al.  Improved approximate string matching and regular expression matching on Ziv-Lempel compressed texts , 2009, TALG.

[14]  竹田 正幸,et al.  A-016 Verifying a Parameterized Border Array in O(n^ ) Time , 2010 .

[15]  Juha Kärkkäinen,et al.  Fast Lightweight Suffix Array Construction and Checking , 2003, CPM.

[16]  Alexandr Andoni,et al.  Efficient algorithms for substring near neighbor problem , 2006, SODA '06.

[17]  Abhi Shelat,et al.  The smallest grammar problem , 2005, IEEE Transactions on Information Theory.

[18]  Richard M. Karp,et al.  Efficient Randomized Pattern-Matching Algorithms , 1987, IBM J. Res. Dev..

[19]  S. Srinivasa Rao,et al.  On Space Efficient Two Dimensional Range Minimum Data Structures , 2011, Algorithmica.

[20]  Moshe Lewenstein,et al.  Faster algorithms for string matching with k mismatches , 2000, SODA '00.

[21]  Milan Ruzic,et al.  Constructing Efficient Dictionaries in Close to Sorting Time , 2008, ICALP.

[22]  Peter van Emde Boas,et al.  Design and implementation of an efficient priority queue , 1976, Mathematical systems theory.

[23]  Robert Sedgewick,et al.  Fast algorithms for sorting and searching strings , 1997, SODA '97.

[24]  Ayumi Shinohara,et al.  A unifying framework for compressed pattern matching , 1999, 6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268).

[25]  Artur Jez,et al.  Faster Fully Compressed Pattern Matching by Recompression , 2011, ICALP.

[26]  Moshe Lewenstein Orthogonal Range Searching for Text Indexing , 2013, Space-Efficient Data Structures, Streams, and Algorithms.

[27]  Hideo Bannai,et al.  Inferring strings from suffix trees and links on a binary alphabet , 2011, Discret. Appl. Math..

[28]  Paul F. Dietz Finding Level-Ancestors in Dynamic Trees , 1991, WADS.

[29]  Milan Ruzic,et al.  Uniform deterministic dictionaries , 2008, TALG.

[30]  Lucas Chi Kwong Hui,et al.  Color Set Size Problem with Application to String Matching , 1992, CPM.

[31]  Simon J. Puglisi,et al.  Space-Time Tradeoffs for Longest-Common-Prefix Array Computation , 2008, ISAAC.

[32]  Philip Bille,et al.  Time-Space Trade-Offs for Longest Common Extensions , 2012, CPM.

[33]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[34]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[35]  Yury Lifshits,et al.  Processing Compressed Texts: A Tractability Border , 2007, CPM.

[36]  Gary Benson,et al.  Efficient two-dimensional compressed matching , 1992, Data Compression Conference, 1992..

[37]  Maxime Crochemore,et al.  Algorithms on strings , 2007 .

[38]  E. Szemerédi,et al.  O(n LOG n) SORTING NETWORK. , 1983 .

[39]  Juha Kärkkäinen,et al.  Faster Sparse Suffix Sorting , 2014, STACS.

[40]  H. Jia,et al.  The human genome-wide distribution of DNA palindromes , 2007, Functional & Integrative Genomics.

[41]  Yury Lifshits Solving Classical String Problems an Compressed Texts , 2006, Combinatorial and Algorithmic Foundations of Pattern and Association Discovery.

[42]  Hiroki Arimura,et al.  Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications , 2001, CPM.

[43]  Ely Porat,et al.  Exact and Approximate Pattern Matching in the Streaming Model , 2009, 2009 50th Annual IEEE Symposium on Foundations of Computer Science.

[44]  Gary Benson,et al.  Let sleeping files lie: pattern matching in Z-compressed files , 1994, SODA '94.

[45]  Juha Kärkkäinen,et al.  Sparse Suffix Trees , 1996, COCOON.

[46]  Maxime Crochemore,et al.  Reverse Engineering Prefix Tables , 2009, STACS.

[47]  Zvi Galil,et al.  Finding all periods and initial palindromes of a string in parallel , 1992, Algorithmica.

[48]  Costas S. Iliopoulos,et al.  Efficient (δ, γ)-pattern-matching with don't cares , 2009 .

[49]  Lucian Ilie,et al.  The longest common extension problem revisited and applications to approximate string searching , 2010, J. Discrete Algorithms.

[50]  Salvatore J. Stolfo,et al.  Anomalous Payload-Based Worm Detection and Signature Generation , 2005, RAID.

[51]  Gad M. Landau,et al.  Fast Parallel and Serial Approximate String Matching , 1989, J. Algorithms.

[52]  Gad M. Landau,et al.  Random access to grammar-compressed strings , 2010, SODA '11.

[53]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[54]  Michael G. Main,et al.  An O(n log n) Algorithm for Finding All Repetitions in a String , 1984, J. Algorithms.

[55]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[56]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[57]  Jens Stoye,et al.  Linear time algorithms for finding and representing all the tandem repeats in a string , 2004, J. Comput. Syst. Sci..

[58]  Manfred Schmidt-Schauß Matching of Compressed Patterns with Character-Variables , 2012, RTA.

[59]  Arne Andersson,et al.  Suffix Trees on Words , 1996, Algorithmica.

[60]  Martin Farach-Colton,et al.  Optimal Suffix Tree Construction with Large Alphabets , 1997, FOCS.

[61]  Terry A. Welch,et al.  A Technique for High-Performance Data Compression , 1984, Computer.

[62]  Shunsuke Inenaga,et al.  On-Line Linear-Time Construction of Word Suffix Trees , 2006, CPM.

[63]  Roberto Grossi,et al.  Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching , 2005, SIAM J. Comput..

[64]  Stephen Alstrup,et al.  Improved Algorithms for Finding Level Ancestors in Dynamic Trees , 2000, ICALP.

[65]  Gad M. Landau,et al.  Incremental String Comparison , 1998, SIAM J. Comput..

[66]  Hideo Bannai,et al.  Verifying and enumerating parameterized border arrays , 2011, Theor. Comput. Sci..

[67]  Hiroki Arimura,et al.  Sparse and Truncated Suffix Trees on Variable-Length Codes , 2011, CPM.

[68]  Arnaud Lefebvre,et al.  Efficient validation and construction of border arrays and validation of string matching automata , 2009, RAIRO Theor. Informatics Appl..

[69]  Wojciech Rytter Algorithms on Compressed Strings and Arrays , 1999, SOFSEM.

[70]  Kurt Mehlhorn,et al.  Bounded Ordered Dictionaries in O(log log N) Time and O(n) Space , 1990, Information Processing Letters.

[71]  Richard Cole,et al.  Faster suffix tree construction with missing suffix links , 2000, STOC '00.

[72]  Glenn K. Manacher,et al.  A New Linear-Time ``On-Line'' Algorithm for Finding the Smallest Initial Palindrome of a String , 1975, JACM.

[73]  Yijie Han Deterministic sorting in O(nlog log n) time and linear space , 2002, STOC '02.

[74]  Ian H. Witten,et al.  Identifying Hierarchical Structure in Sequences: A linear-time algorithm , 1997, J. Artif. Intell. Res..

[75]  Pawel Gawrychowski,et al.  Pattern Matching in Lempel-Ziv Compressed Strings: Fast, Simple, and Deterministic , 2011, ESA.

[76]  Ayumi Shinohara,et al.  Efficient algorithms to compute compressed longest common substrings and compressed palindromes , 2009, Theor. Comput. Sci..

[77]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[78]  Gad M. Landau,et al.  An Algorithm for Approximate Tandem Repeats , 1993, CPM.

[79]  Gonzalo Navarro,et al.  Regular expression searching on compressed text , 2003, J. Discrete Algorithms.

[80]  Arnaud Lefebvre,et al.  Border Array on Bounded Alphabet , 2002, Stringology.

[81]  Giuseppe F. Italiano,et al.  On suffix extensions in suffix trees , 2011, Theor. Comput. Sci..

[82]  Markus Lohrey,et al.  Algorithmics on SLP-compressed strings: A survey , 2012, Groups Complex. Cryptol..

[83]  Artur Jez,et al.  Validating the Knuth-Morris-Pratt Failure Function, Fast and Online , 2010, Theory of Computing Systems.

[84]  Uzi Vishkin,et al.  Finding Level-Ancestors in Trees , 1994, J. Comput. Syst. Sci..

[85]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[86]  Robert E. Tarjan,et al.  Fast Algorithms for Finding Nearest Common Ancestors , 1984, SIAM J. Comput..

[87]  Gonzalo Navarro,et al.  Self-Indexed Grammar-Based Compression , 2011, Fundam. Informaticae.

[88]  Volker Heun,et al.  Theoretical and Practical Improvements on the RMQ-Problem, with Applications to LCA and LCE , 2006, CPM.

[89]  Michael A. Bender,et al.  The Level Ancestor Problem Simplified , 2002, LATIN.

[90]  Wojciech Rytter,et al.  Application of Lempel-Ziv factorization to the approximation of grammar-based compression , 2002, Theor. Comput. Sci..

[91]  W. F. Smyth,et al.  Verifying a border array in linear time , 1999 .

[92]  Charles J. Colbourn,et al.  Quorums from difference covers , 2000, Inf. Process. Lett..

[93]  Ayumi Shinohara,et al.  Fully compressed pattern matching algorithm for balanced straight-line programs , 2000, Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000.

[94]  Gregory Kucherov,et al.  On the combinatorics of suffix arrays , 2012, Inf. Process. Lett..

[95]  Yehuda Afek,et al.  Automated signature extraction for high volume attacks , 2013, Architectures for Networking and Communications Systems.

[96]  Wojciech Plandowski,et al.  Efficient algorithms for Lempel-Ziv encoding , 1996 .

[97]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[98]  Jon Crowcroft,et al.  Honeycomb , 2004, Comput. Commun. Rev..

[99]  Peter Sanders,et al.  Linear work suffix array construction , 2006, JACM.

[100]  Michael L. Fredman,et al.  Surpassing the Information Theoretic Bound with Fusion Trees , 1993, J. Comput. Syst. Sci..

[101]  Filippo Mignosi,et al.  Simple real-time constant-space string matching , 2011, Theor. Comput. Sci..

[102]  Ricardo A. Baeza-Yates,et al.  Fast searching on compressed text allowing errors , 1998, SIGIR '98.

[103]  Johan Jeuring The derivation of on-line algorithms, with an application to finding palindromes , 2005, Algorithmica.

[104]  János Komlós,et al.  Storing a sparse table with O(1) worst case access time , 1982, 23rd Annual Symposium on Foundations of Computer Science (sfcs 1982).

[105]  A. Apostolico,et al.  Off-line compression by greedy textual substitution , 2000, Proceedings of the IEEE.

[106]  Gregory Kucherov,et al.  Searching for gapped palindromes , 2008, Theor. Comput. Sci..

[107]  Maxime Crochemore,et al.  Cover Array String Reconstruction , 2010, CPM.

[108]  Jean-Paul Allouche,et al.  Palindrome complexity , 2003, Theor. Comput. Sci..

[109]  Mikkel Thorup,et al.  Dynamic ordered sets with exponential search trees , 2002, J. ACM.

[110]  Igor Potapov,et al.  Real-time traversal in grammar-based compressed files , 2005, Data Compression Conference.

[111]  Graham Cormode,et al.  Substring compression problems , 2005, SODA '05.

[112]  Kenneth E. Batcher,et al.  Sorting networks and their applications , 1968, AFIPS Spring Joint Computing Conference.

[113]  Friedhelm Meyer auf der Heide,et al.  A New Universal Class of Hash Functions and Dynamic Hashing in Real Time , 1990, ICALP.

[114]  Allan Borodin,et al.  A time-space tradeoff for sorting on a general sequential model of computation , 1980, STOC '80.

[115]  Wojciech Rytter,et al.  Jewels of stringology , 2002 .

[116]  Mikkel Thorup,et al.  String Matching in Lempel—Ziv Compressed Strings , 1998, Algorithmica.

[117]  Arnold L. Rosenberg,et al.  Rapid identification of repeated patterns in strings, trees and arrays , 1972, STOC.

[118]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[119]  Hideo Bannai,et al.  Verifying a Parameterized Border Array in O(n1.5) Time , 2010, CPM.

[120]  Donald R. Morrison,et al.  PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric , 1968, J. ACM.

[121]  Esko Ukkonen,et al.  Longest common substrings with k mismatches , 2014, Inf. Process. Lett..

[122]  being Knuth-Morris-Pratt Optimal pattern matching in LZW compressed strings , 2010 .

[123]  Arnaud Lefebvre,et al.  Words over an ordered alphabet and suffix permutations , 2002, RAIRO Theor. Informatics Appl..

[124]  Ramesh Hariharan,et al.  Optimal Parallel Construction of Minimal Suffix and Factor Automata , 1996, Parallel Process. Lett..

[125]  Christos Makris,et al.  Improved Bounds for Finger Search on a RAM , 2003, Algorithmica.

[126]  Setsuo Arikawa,et al.  Faster approximate string matching over compressed text , 2001, Proceedings DCC 2001. Data Compression Conference.

[127]  Gregory Kucherov,et al.  Pattern Matching on Sparse Suffix Trees , 2011, 2011 First International Conference on Data Compression, Communications and Processing.

[128]  Gonzalo Navarro,et al.  Approximate Matching of Run-Length Compressed Strings , 2001, CPM.

[129]  Wojciech Rytter,et al.  Pattern-Matching for Strings with Short Descriptions , 1995, CPM.

[130]  Juha Kärkkäinen,et al.  A Faster Grammar-Based Self-index , 2011, LATA.

[131]  Johannes Fischer,et al.  Suffix Arrays on Words , 2007, CPM.

[132]  Gonzalo Navarro,et al.  Approximate string matching on Ziv-Lempel compressed text , 2003, J. Discrete Algorithms.

[133]  Dan E. Willard,et al.  Log-logarithmic worst-case range queries are possible in space ⊕(N) , 1983 .

[134]  Wojciech Plandowski,et al.  Randomized Efficient Algorithms for Compressed Strings: The Finger-Print Approach (Extended Abstract) , 1996, CPM.

[135]  Hjalte Wedel Vildhøj,et al.  Time-Space Trade-Offs for the Longest Common Substring Problem , 2013, CPM.

[136]  En-Hui Yang,et al.  Grammar-based codes: A new class of universal lossless source codes , 2000, IEEE Trans. Inf. Theory.

[137]  Mario A. López,et al.  Generalized intersection searching problems , 1993, Int. J. Comput. Geom. Appl..

[138]  Hideo Bannai,et al.  Counting Parameterized Border Arrays for a Binary Alphabet , 2009, LATA.

[139]  Michael E. Saks,et al.  Time-space trade-off lower bounds for randomized computation of decision problems , 2003, JACM.

[140]  Igor Potapov,et al.  Space efficient search for maximal repetitions , 2005, Theor. Comput. Sci..

[141]  Moshe Lewenstein,et al.  Weighted Ancestors in Suffix Trees , 2014, ESA.

[142]  S. Muthukrishnan,et al.  On the sorting-complexity of suffix tree construction , 2000, JACM.

[143]  Eric Rivals,et al.  Reverse engineering of compact suffix trees and links: A novel algorithm , 2014, J. Discrete Algorithms.

[144]  Richard Cole,et al.  Approximate string matching: a simpler faster algorithm , 2002, SODA '98.

[145]  Michael A. Bender,et al.  The LCA Problem Revisited , 2000, LATIN.

[146]  Graham Cormode,et al.  The string edit distance matching problem with moves , 2002, SODA '02.

[147]  Raphaël Clifford,et al.  Element Distinctness, Frequency Moments, and Sliding Windows , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.