Efficient Computation of Sequence Mappability

Sequence mappability is an important task in genome re-sequencing. In the (k, m)-mappability problem, for a given sequence T of length n, our goal is to compute a table whose ith entry is the number of indices \(j \ne i\) such that length-m substrings of T starting at positions i and j have at most k mismatches. Previous works on this problem focused on heuristic approaches to compute a rough approximation of the result or on the case of \(k=1\). We present several efficient algorithms for the general case of the problem. Our main result is an algorithm that works in \(\mathcal {O}(n \min \{m^k,\log ^{k+1} n\})\) time and \(\mathcal {O}(n)\) space for \(k=\mathcal {O}(1)\). It requires a careful adaptation of the technique of Cole et al. [STOC 2004] to avoid multiple counting of pairs of substrings. We also show \(\mathcal {O}(n^2)\)-time algorithms to compute all results for a fixed m and all \(k=0,\ldots ,m\) or a fixed k and all \(m=k,\ldots ,n-1\). Finally we show that the (k, m)-mappability problem cannot be solved in strongly subquadratic time for \(k,m = \varTheta (\log n)\) unless the Strong Exponential Time Hypothesis fails.

[1]  Richard Cole,et al.  Dictionary matching and indexing with errors and don't cares , 2004, STOC '04.

[2]  Giovanni Manzini,et al.  Longest Common Prefix with Mismatches , 2015, SPIRE.

[3]  Costas S. Iliopoulos,et al.  Longest Common Prefixes with k-Errors and Applications , 2018, SPIRE.

[4]  Costas S. Iliopoulos,et al.  Mapping uniquely occurring short sequences derived from high throughput technologies to a reference genome , 2009, 2009 9th International Conference on Information Technology and Applications in Biomedicine.

[5]  David G. Knowles,et al.  Fast Computation and Applications of Genome Mappability , 2012, PloS one.

[6]  Hiroki Arimura,et al.  Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications , 2001, CPM.

[7]  Srinivas Aluru,et al.  Algorithmic Framework for Approximate Matching Under Bounded Edits with Applications to Sequence Analysis , 2018, RECOMB.

[8]  János Komlós,et al.  Storing a sparse table with O(1) worst case access time , 1982, 23rd Annual Symposium on Foundations of Computer Science (sfcs 1982).

[9]  Costas S. Iliopoulos,et al.  Longest Common Prefixes with k-Mismatches and Applications , 2018, SOFSEM.

[10]  Costas S. Iliopoulos,et al.  Faster algorithms for 1-mappability of a sequence , 2020, Theor. Comput. Sci..

[11]  Tatiana Starikovskaya Longest Common Substring with Approximately k Mismatches , 2016, CPM.

[12]  Russell Impagliazzo,et al.  Which Problems Have Strongly Exponential Complexity? , 2001, J. Comput. Syst. Sci..

[13]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[14]  Michael L. Fredman And e.szemer~di.storing a sparse table with o(1) worst case access time , 1982, FOCS 1982.

[15]  Xerox Polo,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976 .

[16]  M. Farach Optimal suffix tree construction with large alphabets , 1997, Proceedings 38th Annual Symposium on Foundations of Computer Science.

[17]  Nuno A. Fonseca,et al.  Tools for mapping high-throughput sequencing data , 2012, Bioinform..

[18]  Brendan D. McKay,et al.  An Algorithm for Generating Subsets of Fixed Size With a Strong Minimal Change Property , 1984, Inf. Process. Lett..

[19]  Srinivas Aluru,et al.  A Provably Efficient Algorithm for the k-Mismatch Average Common Substring Problem , 2016, J. Comput. Biol..

[20]  Russell Impagliazzo,et al.  On the Complexity of k-SAT , 2001, J. Comput. Syst. Sci..

[21]  Wojciech Rytter,et al.  Linear-Time Algorithm for Long LCF with k Mismatches , 2018, CPM.

[22]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[23]  Peter Sanders,et al.  Linear work suffix array construction , 2006, JACM.