论文信息 - Efficient Index for Weighted Sequences

Efficient Index for Weighted Sequences

The problem of finding factors of a text string which are identical or similar to a given pattern string is a central problem in computer science. A generalised version of this problem consists in implementing an index over the text to support efficient on-line pattern queries. We study this problem in the case where the text is weighted: for every position of the text and every letter of the alphabet a probability of occurrence of this letter at this position is given. Sequences of this type, also called position weight matrices, are commonly used to represent imprecise or uncertain data. A weighted sequence may represent many different strings, each with probability of occurrence equal to the product of probabilities of its letters at subsequent positions. Given a probability threshold $1/z$, we say that a pattern string $P$ matches a weighted text at position $i$ if the product of probabilities of the letters of $P$ at positions $i,\ldots,i+|P|-1$ in the text is at least $1/z$. In this article, we present an $O(nz)$-time construction of an $O(nz)$-sized index that can answer pattern matching queries in a weighted text in optimal time improving upon the state of the art by a factor of $z \log z$. Other applications of this data structure include an $O(nz)$-time construction of the weighted prefix table and an $O(nz)$-time computation of all covers of a weighted sequence, which improve upon the state of the art by the same factor.

Solon P. Pissis | Jakub Radoszewski | Tomasz Kociumaka | Carl Barton

[1] Michael A. Bender,et al. The LCA Problem Revisited , 2000, LATIN.

[2] Solon P. Pissis,et al. Linear-Time Computation of Prefix Table for Weighted Strings , 2015, WORDS.

[3] James Bailey,et al. Efficient Matching of Substrings in Uncertain Sequences , 2014, SDM.

[4] Costas S. Iliopoulos,et al. Pattern Matching on Weighted Sequences , 2004 .

[5] Maxime Crochemore,et al. Algorithms on strings , 2007 .

[6] Tsvi Kopelowitz,et al. Property matching and weighted matching , 2006, Theor. Comput. Sci..

[7] Robert E. Tarjan,et al. Fast Algorithms for Finding Nearest Common Ancestors , 1984, SIAM J. Comput..

[8] G. Loukidis,et al. SIAM International Conference on Data Mining (SDM) , 2015 .

[9] S. Muthukrishnan,et al. Efficient algorithms for document retrieval problems , 2002, SODA '02.

[10] János Komlós,et al. Storing a sparse table with O(1) worst case access time , 1982, 23rd Annual Symposium on Foundations of Computer Science (sfcs 1982).

[11] Sharma V. Thankachan,et al. Probabilistic Threshold Indexing for Uncertain Strings , 2015, EDBT.

[12] Tetsuo Shibuya. Constructing the Suffix Tree of a Tree with a Large Alphabet , 1999, ISAAC.

[13] János Komlós,et al. Storing a sparse table with O(1) worst case access time , 1982, 23rd Annual Symposium on Foundations of Computer Science (sfcs 1982).

[14] Dany Breslauer. The suffix Tree of a Tree and Minimizing Sequential Transducers , 1996, CPM.

[15] Gonzalo Navarro,et al. A guided tour to approximate string matching , 2001, CSUR.

[16] Lucas Chi Kwong Hui,et al. Color Set Size Problem with Application to String Matching , 1992, CPM.

[17] Costas S. Iliopoulos,et al. The Weighted Suffix Tree: An Efficient Data Structure for Handling Molecular Weighted Sequences and its Applications , 2006, Fundam. Informaticae.