Linear-time computation of prefix table for weighted strings & applications

The prefix table of a string is one of the most fundamental data structures of algorithms on strings: it determines the longest factor at each position of the string that matches a prefix of the string. It can be computed in time linear with respect to the size of the string, and hence it can be used efficiently for locating patterns or for regularity searching in strings. A weighted string is a string in which a set of letters may occur at each position with respective occurrence probabilities. Weighted strings, also known as position weight matrices or uncertain strings, naturally arise in many biological contexts; for example, they provide a method to realise approximation among occurrences of the same DNA segment. In this article, given a weighted string x of length n and a constant cumulative weight threshold 1 / z , defined as the minimal probability of occurrence of factors in x, we present an O ( n ) -time algorithm for computing the prefix table of x. Furthermore, we outline a number of applications of this result for solving various problems on non-standard strings, and present some preliminary experimental results.

[1]  M. Crochemore,et al.  Algorithms on Strings: Tools , 2007 .

[2]  Shu Wang,et al.  Fast pattern-matching on indeterminate strings , 2008, J. Discrete Algorithms.

[3]  Maxime Crochemore,et al.  Algorithms on strings , 2007 .

[4]  Katsuhiko Kakehi,et al.  Merging String Sequences by Longest Common Prefixes , 2008 .

[5]  William F. Smyth,et al.  Prefix Table Construction and Conversion , 2013, IWOCA.

[6]  Gaston H. Gonnet,et al.  Probabilistic Ancestral Sequences and Multiple Alignments , 1996, SWAT.

[7]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[8]  Hiroki Arimura,et al.  Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications , 2001, CPM.

[9]  Costas S. Iliopoulos,et al.  Optimal computation of all tandem repeats in a weighted sequence , 2014, Algorithms for Molecular Biology.

[10]  L. Kedes,et al.  Nomenclature for incompletely specified bases in nucleic acid sequences. Recommendations 1984. Nomenclature Committee of the International Union of Biochemistry (NC-IUB). , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Portland Press Ltd Nomenclature Committee for the International Union of Biochemistry (NC-IUB). Nomenclature for incompletely specified bases in nucleic acid sequences. Recommendations 1984. , 1985, Molecular biology and evolution.

[12]  Costas S. Iliopoulos,et al.  Enhanced string covering , 2013, Theor. Comput. Sci..

[13]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[14]  Peter Sanders,et al.  Linear work suffix array construction , 2006, JACM.

[15]  Wojciech Rytter,et al.  Polynomial-time approximation algorithms for weighted LCS problem , 2016, Discret. Appl. Math..

[16]  Tanya Z. Berardini,et al.  PatMatch: a program for finding patterns in peptide and nucleotide sequences , 2005, Nucleic Acids Res..

[17]  T. Gibson,et al.  Applying motif and profile searches. , 1996, Methods in enzymology.

[18]  Jiajie Zhang,et al.  PEAR: a fast and accurate Illumina Paired-End reAd mergeR , 2013, Bioinform..

[19]  Nomenclature Committee of the International Union of Biochemistry (NC-IUB). Nomenclature of electron-transfer proteins. Recommendations 1989. , 1991, European journal of biochemistry.

[20]  Costas S. Iliopoulos,et al.  Fast and Simple Computations Using Prefix Tables Under Hamming and Edit Distance , 2014, IWOCA.

[21]  Dekel Tsur,et al.  Improved Filters for the Approximate Suffix-Prefix Overlap Problem , 2014, SPIRE.

[22]  A. Cornish-Bowden Nomenclature for incompletely specified bases in nucleic acid sequences: recommendations 1984. , 1985, Nucleic acids research.

[23]  Richard A. Goldstein,et al.  Probabilistic reconstruction of ancestral protein sequences , 1996, Journal of Molecular Evolution.

[24]  Costas S. Iliopoulos,et al.  Computing the Repetitions in a Biological Weighted Sequence , 2005, J. Autom. Lang. Comb..

[25]  Solon P. Pissis,et al.  Linear-Time Computation of Prefix Table for Weighted Strings , 2015, WORDS.

[26]  Costas S. Iliopoulos,et al.  String Regularities with Don't Cares , 2003, Nord. J. Comput..

[27]  Susana Ladra,et al.  Approximate All-Pairs Suffix/Prefix Overlaps , 2010, CPM.

[28]  Volker Heun,et al.  Space-Efficient Preprocessing Schemes for Range Minimum Queries on Static Arrays , 2011, SIAM J. Comput..

[29]  Amihood Amir,et al.  Weighted LCS , 2009, IWOCA.

[30]  Esko Ukkonen,et al.  Fast profile matching algorithms - A survey , 2008, Theor. Comput. Sci..

[31]  Solon P. Pissis,et al.  Optimal Computation of all Repetitions in a Weighted String , 2014, ICABD.

[32]  Dany Breslauer,et al.  An On-Line String Superprimitivity Test , 1992, Inf. Process. Lett..

[33]  Costas S. Iliopoulos,et al.  The Weighted Suffix Tree: An Efficient Data Structure for Handling Molecular Weighted Sequences and its Applications , 2006, Fundam. Informaticae.

[35]  Tsvi Kopelowitz,et al.  Property matching and weighted matching , 2006, Theor. Comput. Sci..

[36]  Shu Wang,et al.  New Perspectives on the Prefix Array , 2008, SPIRE.

[37]  Costas S. Iliopoulos,et al.  Approximate Matching in Weighted Sequences , 2006, CPM.

[38]  Costas S. Iliopoulos,et al.  Computation of Repetitions and Regularities of Biologically Weighted Sequences , 2006, J. Comput. Biol..