D S ] 2 5 A pr 2 01 7 Indexing Weighted Sequences : Neat and Efficient

In a weighted sequence, for every position of the sequence and every letter of the alphabet a probability of occurrence of this letter at this position is specified. Weighted sequences are commonly used to represent imprecise or uncertain data, for example, in molecular biology where they are known under the name of Position-Weight Matrices. Given a probability threshold 1 z , we say that a string P of length m matches a weighted sequence X at starting position i if the product of probabilities of the letters of P at positions i, . . . , i+m− 1 in X is at least 1 z . In this article, we consider an indexing variant of the problem, in which we are to preprocess a weighted sequence to answer multiple pattern matching queries. We present an O(nz)-time construction of an O(nz)-sized index for a weighted sequence of length n over an integer alphabet that answers pattern matching queries in optimal, O(m+Occ) time, where Occ is the number of occurrences reported. Our new index is based on a non-trivial construction of a family of ⌊z⌋ weighted sequences of an especially simple form that are equivalent to a general weighted sequence. This new combinatorial insight allowed us to obtain: a construction of the index in the case of a constant-sized alphabet with the same complexities as in (Barton et al., CPM 2016) but with a simple implementation; a deterministic construction in the case of a general integer alphabet (the construction of Barton et al. in this case was randomised); an improvement of the space complexity from O(nz) to O(nz) of a more general index for weighted sequences that was presented in (Biswas et al., EDBT 2016); and a significant improvement of the complexities of the approximate variant of the index of Biswas et al.

[1]  Gonzalo Navarro,et al.  Top-k document retrieval in optimal time and linear space , 2012, SODA.

[2]  Solon P. Pissis,et al.  Pattern Matching and Consensus Problems on Weighted Sequences and Profiles , 2016, Theory of Computing Systems.

[3]  Tsvi Kopelowitz,et al.  Property matching and weighted matching , 2006, Theor. Comput. Sci..

[4]  Maxime Crochemore,et al.  Algorithms on strings , 2007 .

[5]  S. Muthukrishnan,et al.  Efficient algorithms for document retrieval problems , 2002, SODA '02.

[6]  Sharma V. Thankachan,et al.  Probabilistic Threshold Indexing for Uncertain Strings , 2015, EDBT.

[7]  Robert E. Tarjan,et al.  A Linear-Time Algorithm for a Special Case of Disjoint Set Union , 1985, J. Comput. Syst. Sci..

[8]  S. Muthukrishnan,et al.  Perfect Hashing for Strings: Formalization and Algorithms , 1996, CPM.

[9]  Philip S. Yu,et al.  A Survey of Uncertain Data Algorithms and Applications , 2009, IEEE Transactions on Knowledge and Data Engineering.

[10]  Solon P. Pissis,et al.  Efficient Index for Weighted Sequences , 2016, CPM.

[11]  Christopher Ré,et al.  Probabilistic databases: diamonds in the dirt , 2009, CACM.

[12]  Sanguthevar Rajasekaran,et al.  The Efficient Computation of Position-Specific Match Scores with the Fast Fourier Transform , 2002, J. Comput. Biol..

[13]  Solon P. Pissis,et al.  Fast Average-Case Pattern Matching on Weighted Sequences , 2015, Int. J. Found. Comput. Sci..

[14]  Solon P. Pissis,et al.  On-Line Pattern Matching on Uncertain Sequences and Applications , 2016, COCOA.

[15]  Moshe Lewenstein,et al.  Weighted Ancestors in Suffix Trees , 2014, ESA.

[16]  M. Farach Optimal suffix tree construction with large alphabets , 1997, Proceedings 38th Annual Symposium on Foundations of Computer Science.

[17]  Costas S. Iliopoulos,et al.  Pattern Matching on Weighted Sequences , 2004 .

[18]  Costas S. Iliopoulos,et al.  The Weighted Suffix Tree: An Efficient Data Structure for Handling Molecular Weighted Sequences and its Applications , 2006, Fundam. Informaticae.

[19]  Gad M. Landau,et al.  Dynamic text and static pattern matching , 2007, TALG.