论文信息 - Efficient Computation of Sequence Mappability

Efficient Computation of Sequence Mappability

Sequence mappability is an important task in genome re-sequencing. In the (k, m)-mappability problem, for a given sequence T of length n, our goal is to compute a table whose ith entry is the number of indices \(j \ne i\) such that length-m substrings of T starting at positions i and j have at most k mismatches. Previous works on this problem focused on heuristic approaches to compute a rough approximation of the result or on the case of \(k=1\). We present several efficient algorithms for the general case of the problem. Our main result is an algorithm that works in \(\mathcal {O}(n \min \{m^k,\log ^{k+1} n\})\) time and \(\mathcal {O}(n)\) space for \(k=\mathcal {O}(1)\). It requires a careful adaptation of the technique of Cole et al. [STOC 2004] to avoid multiple counting of pairs of substrings. We also show \(\mathcal {O}(n^2)\)-time algorithms to compute all results for a fixed m and all \(k=0,\ldots ,m\) or a fixed k and all \(m=k,\ldots ,n-1\). Finally we show that the (k, m)-mappability problem cannot be solved in strongly subquadratic time for \(k,m = \varTheta (\log n)\) unless the Strong Exponential Time Hypothesis fails.

[1] Richard Cole,et al. Dictionary matching and indexing with errors and don't cares , 2004, STOC '04.

[2] Giovanni Manzini,et al. Longest Common Prefix with Mismatches , 2015, SPIRE.

[3] Costas S. Iliopoulos,et al. Longest Common Prefixes with k-Errors and Applications , 2018, SPIRE.

[4] Costas S. Iliopoulos,et al. Mapping uniquely occurring short sequences derived from high throughput technologies to a reference genome , 2009, 2009 9th International Conference on Information Technology and Applications in Biomedicine.

[5] David G. Knowles,et al. Fast Computation and Applications of Genome Mappability , 2012, PloS one.

[6] Hiroki Arimura,et al. Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications , 2001, CPM.

[7] Srinivas Aluru,et al. Algorithmic Framework for Approximate Matching Under Bounded Edits with Applications to Sequence Analysis , 2018, RECOMB.

[8] János Komlós,et al. Storing a sparse table with O(1) worst case access time , 1982, 23rd Annual Symposium on Foundations of Computer Science (sfcs 1982).

[9] Costas S. Iliopoulos,et al. Longest Common Prefixes with k-Mismatches and Applications , 2018, SOFSEM.

[10] Costas S. Iliopoulos,et al. Faster algorithms for 1-mappability of a sequence , 2020, Theor. Comput. Sci..

[11] Tatiana Starikovskaya. Longest Common Substring with Approximately k Mismatches , 2016, CPM.

[12] Russell Impagliazzo,et al. Which Problems Have Strongly Exponential Complexity? , 2001, J. Comput. Syst. Sci..

[13] Edward M. McCreight,et al. A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[14] Michael L. Fredman. And e.szemer~di.storing a sparse table with o(1) worst case access time , 1982, FOCS 1982.

[15] Xerox Polo,et al. A Space-Economical Suffix Tree Construction Algorithm , 1976 .

[16] M. Farach. Optimal suffix tree construction with large alphabets , 1997, Proceedings 38th Annual Symposium on Foundations of Computer Science.

[17] Nuno A. Fonseca,et al. Tools for mapping high-throughput sequencing data , 2012, Bioinform..

[18] Brendan D. McKay,et al. An Algorithm for Generating Subsets of Fixed Size With a Strong Minimal Change Property , 1984, Inf. Process. Lett..

[19] Srinivas Aluru,et al. A Provably Efficient Algorithm for the k-Mismatch Average Common Substring Problem , 2016, J. Comput. Biol..

[20] Russell Impagliazzo,et al. On the Complexity of k-SAT , 2001, J. Comput. Syst. Sci..

[21] Wojciech Rytter,et al. Linear-Time Algorithm for Long LCF with k Mismatches , 2018, CPM.

[22] Eugene W. Myers,et al. Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[23] Peter Sanders,et al. Linear work suffix array construction , 2006, JACM.