Pattern Matching and Consensus Problems on Weighted Sequences and Profiles

We study pattern matching problems on two major representations of uncertain sequences used in molecular biology: weighted sequences (also known as position weight matrices, PWM) and profiles (i.e., scoring matrices). In the simple version, in which only the pattern or only the text is uncertain, we obtain efficient algorithms with theoretically-provable running times using a variation of the lookahead scoring technique. We also consider a general variant of the pattern matching problems in which both the pattern and the text are uncertain. Central to our solution is a special case where the sequences have equal length, called the consensus problem. We propose algorithms for the consensus problem parameterized by the number of strings that match one of the sequences. As our basic approach, a careful adaptation of the classic meet-in-the-middle algorithm for the knapsack problem is used. On the lower bound side, we prove that our dependence on the parameter is optimal up to lower-order terms conditioned on the optimality of the original algorithm for the knapsack problem.

[1]  Virginia Vassilevska Williams,et al.  Hardness of Easy Problems: Basing Hardness on Popular Conjectures such as the Strong Exponential Time Hypothesis (Invited Talk) , 2015, IPEC.

[2]  Solon P. Pissis,et al.  Linear-Time Computation of Prefix Table for Weighted Strings , 2015, WORDS.

[3]  Solon P. Pissis,et al.  Efficient Index for Weighted Sequences , 2016, CPM.

[4]  Russell Impagliazzo,et al.  On the Complexity of k-SAT , 2001, J. Comput. Syst. Sci..

[5]  Eitan M. Gurari,et al.  Introduction to the theory of computation , 1989 .

[6]  Mark H. Overmars,et al.  On a Class of O(n2) Problems in Computational Geometry , 1995, Comput. Geom..

[7]  Michael Sipser,et al.  Introduction to the Theory of Computation , 1996, SIGA.

[8]  Amir Abboud,et al.  Exact Weight Subgraphs and the k-Sum Conjecture , 2013, ICALP.

[9]  Dániel Marx,et al.  Lower bounds based on the Exponential Time Hypothesis , 2011, Bull. EATCS.

[10]  Sharma V. Thankachan,et al.  Probabilistic Threshold Indexing for Uncertain Strings , 2015, EDBT.

[11]  János Komlós,et al.  Storing a sparse table with O(1) worst case access time , 1982, 23rd Annual Symposium on Foundations of Computer Science (sfcs 1982).

[12]  Costas S. Iliopoulos,et al.  Pattern Matching on Weighted Sequences , 2004 .

[13]  Costas S. Iliopoulos,et al.  The Weighted Suffix Tree: An Efficient Data Structure for Handling Molecular Weighted Sequences and its Applications , 2006, Fundam. Informaticae.

[14]  Esko Ukkonen,et al.  Fast profile matching algorithms - A survey , 2008, Theor. Comput. Sci..

[15]  János Komlós,et al.  Storing a sparse table with O(1) worst case access time , 1982, 23rd Annual Symposium on Foundations of Computer Science (sfcs 1982).

[16]  Milan Ruzic,et al.  Constructing Efficient Dictionaries in Close to Sorting Time , 2008, ICALP.

[17]  Sanguthevar Rajasekaran,et al.  The Efficient Computation of Position-Specific Match Scores with the Fast Fourier Transform , 2002, J. Comput. Biol..

[18]  Costas S. Iliopoulos,et al.  Proceedings of the Algorithms and Computational Methods for Biochemical and Evolutionary Networks 2004 (CompBioNets'04) , 2004 .

[19]  C. Papadimitriou,et al.  Introduction to the Theory of Computation , 2018 .

[20]  Richard M. Karp,et al.  Reducibility Among Combinatorial Problems , 1972, 50 Years of Integer Programming.

[21]  Tsvi Kopelowitz,et al.  Property matching and weighted matching , 2006, Theor. Comput. Sci..

[22]  Stefan Kratsch,et al.  Polynomial kernels for weighted problems , 2015, J. Comput. Syst. Sci..

[23]  Mihai Patrascu,et al.  On the possibility of faster SAT algorithms , 2010, SODA '10.

[24]  Ellis Horowitz,et al.  Computing Partitions with Applications to the Knapsack Problem , 1974, JACM.

[25]  Maxime Crochemore,et al.  Algorithms on strings , 2007 .

[26]  Wyeth W. Wasserman,et al.  JASPAR: an open-access database for eukaryotic transcription factor binding profiles , 2004, Nucleic Acids Res..