On-line weighted pattern matching

Abstract A weighted sequence is a sequence of probability distributions over an alphabet of size σ. Weighted sequences arise naturally in many applications. We study the problem of weighted pattern matching in which we are given a string pattern P of length m, a weight threshold 1 z , and a weighted text X arriving on-line. We say that P occurs in X at position i if the product of probabilities of the letters of P at positions i − m + 1 , … , i in X is at least 1 z . We first discuss how to apply a known general scheme that transforms off-line pattern matching algorithms to on-line algorithms to obtain an on-line algorithm that requires O ( ( σ + log ⁡ z ) log ⁡ m ) or O ( σ log 2 ⁡ m ) time per arriving position; with the space requirement however being O ( m min ⁡ ( σ , z ) ) . Our main result is a new algorithm that processes each arriving position of X in O ( z + σ ) time using O ( m + z ) extra space.

[1]  Solon P. Pissis,et al.  Indexing Weighted Sequences: Neat and Efficient , 2020, Inf. Comput..

[2]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[3]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[4]  Solon P. Pissis,et al.  On-Line Pattern Matching on Uncertain Sequences and Applications , 2016, COCOA.

[5]  Maxime Crochemore,et al.  Algorithms on strings , 2007 .

[6]  Amihood Amir,et al.  Weighted LCS , 2009, J. Discrete Algorithms.

[7]  Tsvi Kopelowitz,et al.  Property matching and weighted matching , 2006, Theor. Comput. Sci..

[8]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[9]  Robert S. Boyer,et al.  A fast string searching algorithm , 1977, CACM.

[10]  T. D. Schneider,et al.  Use of the 'Perceptron' algorithm to distinguish translational initiation sites in E. coli. , 1982, Nucleic acids research.

[11]  Philip S. Yu,et al.  A Survey of Uncertain Data Algorithms and Applications , 2009, IEEE Transactions on Knowledge and Data Engineering.

[12]  Sharma V. Thankachan,et al.  Probabilistic Threshold Indexing for Uncertain Strings , 2015, EDBT.

[13]  Solon P. Pissis,et al.  Efficient Index for Weighted Sequences , 2016, CPM.

[14]  Solon P. Pissis,et al.  Fast Average-Case Pattern Matching on Weighted Sequences , 2015, Int. J. Found. Comput. Sci..

[15]  Solon P. Pissis,et al.  Linear-time computation of prefix table for weighted strings & applications , 2016, Theor. Comput. Sci..

[16]  Costas S. Iliopoulos,et al.  Approximate Matching in Weighted Sequences , 2006, CPM.

[17]  János Komlós,et al.  Storing a sparse table with O(1) worst case access time , 1982, 23rd Annual Symposium on Foundations of Computer Science (sfcs 1982).

[18]  Solon P. Pissis,et al.  Pattern Matching and Consensus Problems on Weighted Sequences and Profiles , 2018, Theory of Computing Systems.

[19]  Ely Porat,et al.  A black box for online approximate pattern matching , 2008, Inf. Comput..

[20]  Amihood Amir,et al.  Weighted Shortest Common Supersequence , 2011, SPIRE.

[21]  Wojciech Rytter,et al.  Polynomial-time approximation algorithms for weighted LCS problem , 2016, Discret. Appl. Math..

[22]  Costas S. Iliopoulos,et al.  Optimal computation of all tandem repeats in a weighted sequence , 2014, Algorithms for Molecular Biology.

[23]  Solon P. Pissis,et al.  Crochemore’s Partitioning on Weighted Strings and Applications , 2017, Algorithmica.

[24]  Costas S. Iliopoulos,et al.  Property Suffix Array with Applications , 2018, LATIN.

[25]  Hongxia Jin,et al.  An Information-Theoretic Approach to Individual Sequential Data Sanitization , 2016, WSDM.

[26]  Costas S. Iliopoulos,et al.  The Weighted Suffix Tree: An Efficient Data Structure for Handling Molecular Weighted Sequences and its Applications , 2006, Fundam. Informaticae.

[27]  Sanguthevar Rajasekaran,et al.  The Efficient Computation of Position-Specific Match Scores with the Fast Fourier Transform , 2002, J. Comput. Biol..

[28]  Jakub Radoszewski,et al.  Streaming K-Mismatch with Error Correcting and Applications , 2017, 2017 Data Compression Conference (DCC).

[29]  Costas S. Iliopoulos,et al.  Pattern Matching on Weighted Sequences , 2004 .

[30]  Jian Wang,et al.  Sequential pattern mining in databases with temporal uncertainty , 2017, Knowledge and Information Systems.