Streaming k-mismatch with error correcting and applications

Abstract We present a new streaming algorithm for the k- Mismatch problem, one of the most basic problems in pattern matching. Given a pattern and a text, the task is to find all substrings of the text that are at the Hamming distance at most k from the pattern. Our algorithm is enhanced with an important new feature called Error Correcting , and its complexities for k = 1 and for a general k are comparable to those of the solutions for the k- Mismatch problem by Porat and Porat (FOCS 2009) and Clifford et al. (SODA 2016). In parallel to our research, a yet more efficient algorithm for the k- Mismatch problem with the Error Correcting feature was developed by Clifford et al. (SODA 2019). Using the new feature and recent work on streaming Multiple Pattern Matching we develop a series of streaming algorithms for pattern matching on weighted strings, which are a commonly used representation of uncertain sequences in molecular biology. We also show that these algorithms are space-optimal up to polylog factors. A preliminary version of this work was published at DCC 2017 conference [24] .

[1]  Solon P. Pissis,et al.  Efficient Index for Weighted Sequences , 2016, CPM.

[2]  Tsvi Kopelowitz,et al.  Towards Optimal Approximate Streaming Pattern Matching by Matching Multiple Patterns in Multiple Streams , 2018, ICALP.

[3]  Solon P. Pissis,et al.  Linear-time computation of prefix table for weighted strings & applications , 2016, Theor. Comput. Sci..

[4]  Costas S. Iliopoulos,et al.  Approximate Matching in Weighted Sequences , 2006, CPM.

[5]  Funda Ergün,et al.  Periodicity in Streams , 2010, APPROX-RANDOM.

[6]  Costas S. Iliopoulos,et al.  Pattern Matching on Weighted Sequences , 2004 .

[7]  Sharma V. Thankachan,et al.  Probabilistic Threshold Indexing for Uncertain Strings , 2015, EDBT.

[8]  Maxime Crochemore,et al.  Algorithms on strings , 2007 .

[9]  Ely Porat,et al.  Dictionary Matching in a Stream , 2015, ESA.

[10]  Ely Porat,et al.  The streaming k-mismatch problem , 2019, SODA.

[11]  Solon P. Pissis,et al.  Indexing Weighted Sequences: Neat and Efficient , 2020, Inf. Comput..

[12]  Raphaël Clifford,et al.  Approximate Hamming Distance in a Stream , 2016, ICALP.

[13]  Ely Porat,et al.  Exact and Approximate Pattern Matching in the Streaming Model , 2009, 2009 50th Annual IEEE Symposium on Foundations of Computer Science.

[14]  Ely Porat,et al.  The k-mismatch problem revisited , 2016, SODA.

[15]  Ely Porat,et al.  Improved Sketching of Hamming Distance with Error Correcting , 2007, CPM.

[16]  Solon P. Pissis,et al.  Linear-Time Computation of Prefix Table for Weighted Strings , 2015, WORDS.

[17]  H. Wilf,et al.  Uniqueness theorems for periodic functions , 1965 .

[18]  Wyeth W. Wasserman,et al.  JASPAR: an open-access database for eukaryotic transcription factor binding profiles , 2004, Nucleic Acids Res..

[19]  Richard M. Karp,et al.  Efficient Randomized Pattern-Matching Algorithms , 1987, IBM J. Res. Dev..

[20]  Solon P. Pissis,et al.  Pattern Matching and Consensus Problems on Weighted Sequences and Profiles , 2016, Theory of Computing Systems.

[21]  Costas S. Iliopoulos,et al.  The Weighted Suffix Tree: An Efficient Data Structure for Handling Molecular Weighted Sequences and its Applications , 2006, Fundam. Informaticae.

[22]  Esko Ukkonen,et al.  Fast profile matching algorithms - A survey , 2008, Theor. Comput. Sci..

[23]  Zvi Galil,et al.  Real-Time Streaming String-Matching , 2014, TALG.

[24]  Andrew Chi-Chih Yao,et al.  Probabilistic computations: Toward a unified measure of complexity , 1977, 18th Annual Symposium on Foundations of Computer Science (sfcs 1977).

[25]  Ely Porat,et al.  A black box for online approximate pattern matching , 2008, Inf. Comput..

[26]  Ely Porat,et al.  Real-Time Streaming Multi-Pattern Search for Constant Alphabet , 2017, ESA.

[27]  Xuhua Xia,et al.  Position Weight Matrix, Gibbs Sampler, and the Associated Significance Tests in Motif Characterization and Prediction , 2012, Scientifica.

[28]  Jakub Radoszewski,et al.  Streaming K-Mismatch with Error Correcting and Applications , 2017, 2017 Data Compression Conference (DCC).

[29]  Ely Porat,et al.  Space lower bounds for online pattern matching , 2013, Theor. Comput. Sci..

[30]  Tsvi Kopelowitz,et al.  Property matching and weighted matching , 2006, Theor. Comput. Sci..