A ug 2 01 7 Indexing Weighted Sequences : Neat and Efficient

In a weighted sequence, for every position of the sequence and every letter of the alphabet a probability of occurrence of this letter at this position is specified. Weighted sequences are commonly used to represent imprecise or uncertain data, for example, in molecular biology where they are known under the name of Position-Weight Matrices. Given a probability threshold 1 z , we say that a string P of length m occurs in a weighted sequence X at position i if the product of probabilities of the letters of P at positions i, . . . , i+m− 1 in X is at least 1 z . In this article, we consider an indexing variant of the problem, in which we are to preprocess a weighted sequence to answer multiple pattern matching queries. We present an O(nz)-time construction of an O(nz)-sized index for a weighted sequence of length n over a constant-sized alphabet that answers pattern matching queries in optimal, O(m+Occ) time, where Occ is the number of occurrences reported. The cornerstone of our data structure is a novel construction of a family of ⌊z⌋ special strings that carries the information about all the strings that occur in the weighted sequence with a sufficient probability. We obtain a weighted index with the same complexities as in the most efficient previously known index by Barton et al. [3], but our construction is significantly simpler. The most complex algorithmic tool required in the basic form of our index is the suffix tree which we use to develop a new, more straightforward index for the so-called property matching problem. We provide an implementation of our data structure. Our construction allows us also to obtain a significant improvement over the complexities of the approximate variant of the weighted index presented by Biswas et al. [6] and an improvement of the space complexity of their general index.

[1]  Tsvi Kopelowitz,et al.  Property matching and weighted matching , 2006, Theor. Comput. Sci..

[2]  Sanguthevar Rajasekaran,et al.  The Efficient Computation of Position-Specific Match Scores with the Fast Fourier Transform , 2002, J. Comput. Biol..

[3]  Solon P. Pissis,et al.  Fast Average-Case Pattern Matching on Weighted Sequences , 2015, Int. J. Found. Comput. Sci..

[4]  Philip S. Yu,et al.  A Survey of Uncertain Data Algorithms and Applications , 2009, IEEE Transactions on Knowledge and Data Engineering.

[5]  Solon P. Pissis,et al.  Efficient Index for Weighted Sequences , 2016, CPM.

[6]  M. Crochemore,et al.  On-line construction of suffix trees , 2002 .

[7]  Tetsuo Shibuya Constructing the Suffix Tree of a Tree with a Large Alphabet , 1999, ISAAC.

[8]  Costas S. Iliopoulos,et al.  Pattern Matching on Weighted Sequences , 2004 .

[9]  Solon P. Pissis,et al.  Pattern Matching and Consensus Problems on Weighted Sequences and Profiles , 2018, Theory of Computing Systems.

[10]  M. Farach Optimal suffix tree construction with large alphabets , 1997, Proceedings 38th Annual Symposium on Foundations of Computer Science.

[11]  Y. L. Wang,et al.  Errata for "Faster index for property matching" , 2009, Inf. Process. Lett..

[12]  Costas S. Iliopoulos,et al.  The Weighted Suffix Tree: An Efficient Data Structure for Handling Molecular Weighted Sequences and its Applications , 2006, Fundam. Informaticae.

[13]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[14]  Lucas Chi Kwong Hui,et al.  Color Set Size Problem with Application to String Matching , 1992, CPM.

[15]  Costas S. Iliopoulos,et al.  Faster index for property matching , 2008, Inf. Process. Lett..

[16]  Solon P. Pissis,et al.  On-Line Pattern Matching on Uncertain Sequences and Applications , 2016, COCOA.

[17]  Gonzalo Navarro,et al.  Top-k document retrieval in optimal time and linear space , 2012, SODA.

[18]  Sharma V. Thankachan,et al.  Probabilistic Threshold Indexing for Uncertain Strings , 2015, EDBT.

[19]  Tsvi Kopelowitz The Property Suffix Tree with Dynamic Properties , 2010, CPM.

[20]  Christopher Ré,et al.  Probabilistic databases: diamonds in the dirt , 2009, CACM.

[21]  S. Muthukrishnan,et al.  Efficient algorithms for document retrieval problems , 2002, SODA '02.