Hardness of Covering Alignment: Phase Transition in Post-Sequence Genomics

Covering alignment problems arise from recent developments in genomics; so called pan-genome graphs are replacing reference genomes, and advances in haplotyping enable full content of diploid genomes to be used as basis of sequence analysis. In this paper, we show that the computational complexity will change for natural extensions of alignments to pan-genome representations and to diploid genomes. More broadly, our approach can also be seen as a minimal extension of sequence alignment to labelled directed acyclic graphs (labeled DAGs). Namely, we show that finding a <italic>covering alignment</italic> of two labeled DAGs is NP-hard even on binary alphabets. A covering alignment asks for two paths <inline-formula><tex-math notation="LaTeX">$R_1$</tex-math><alternatives><inline-graphic xlink:href="makinen-ieq1-2831691.gif"/></alternatives></inline-formula> (red) and <inline-formula><tex-math notation="LaTeX">$G_1$</tex-math><alternatives><inline-graphic xlink:href="makinen-ieq2-2831691.gif"/></alternatives></inline-formula> (green) in DAG <inline-formula><tex-math notation="LaTeX">$D_1$</tex-math><alternatives><inline-graphic xlink:href="makinen-ieq3-2831691.gif"/></alternatives></inline-formula> and two paths <inline-formula><tex-math notation="LaTeX">$R_2$</tex-math><alternatives><inline-graphic xlink:href="makinen-ieq4-2831691.gif"/></alternatives></inline-formula> (red) and <inline-formula><tex-math notation="LaTeX">$G_2$</tex-math><alternatives><inline-graphic xlink:href="makinen-ieq5-2831691.gif"/></alternatives></inline-formula> (green) in DAG <inline-formula><tex-math notation="LaTeX">$D_2$</tex-math><alternatives><inline-graphic xlink:href="makinen-ieq6-2831691.gif"/></alternatives></inline-formula> that cover the nodes of the graphs and maximize the sum of the global alignment scores: <inline-formula><tex-math notation="LaTeX">$\mathsf {as}(\mathsf {sp}(R_1),\mathsf {sp}(R_2))+\mathsf {as}(\mathsf {sp}(G_1),\mathsf {sp}(G_2))$</tex-math><alternatives><inline-graphic xlink:href="makinen-ieq7-2831691.gif"/></alternatives></inline-formula>, where <inline-formula><tex-math notation="LaTeX">$\mathsf {sp}(P)$</tex-math><alternatives><inline-graphic xlink:href="makinen-ieq8-2831691.gif"/></alternatives></inline-formula> is the concatenation of labels on the path <inline-formula><tex-math notation="LaTeX">$P$</tex-math><alternatives><inline-graphic xlink:href="makinen-ieq9-2831691.gif"/></alternatives></inline-formula>. Pair-wise alignment of haplotype sequences forming a diploid chromosome can be converted to a two-path coverable labelled DAG, and then the covering alignment models the similarity of two diploids over arbitrary recombinations. We also give a reduction to the other direction, to show that such a recombination-oblivious diploid alignment is NP-hard on alphabets of size 3.

[1]  de Ng Dick Bruijn A combinatorial problem , 1946 .

[2]  Roberto Grossi,et al.  Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching , 2005, SIAM J. Comput..

[3]  Ying Zhang,et al.  Computational pan-genomics: status, promises and challenges , 2016, bioRxiv.

[4]  Piotr Indyk,et al.  Edit Distance Cannot Be Computed in Strongly Subquadratic Time (unless SETH is false) , 2014, STOC.

[5]  The Computational Pan-Genomics Consortium,et al.  Computational pan-genomics: status, promises and challenges , 2018, Briefings Bioinform..

[6]  I. Good Normal Recurring Decimals , 1946 .

[7]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[8]  Dominique Perrin,et al.  The origins of combinatorics on words , 2007, Eur. J. Comb..

[9]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[10]  Alexandru I. Tomescu,et al.  Genome-Scale Algorithm Design: Biological Sequence Analysis in the Era of High-Throughput Sequencing , 2015 .

[11]  Meng He,et al.  Indexing Compressed Text , 2003 .

[12]  John D. Kececioglu,et al.  Reconstructing a history of recombinations from a set of sequences , 1994, SODA '94.

[13]  Paola Bonizzoni,et al.  The complexity of multiple sequence alignment with SP-score that is a metric , 2001, Theor. Comput. Sci..

[14]  Romeo Rizzi,et al.  On Recognizing Words That Are Squares for the Shuffle Product , 2013, CSR.

[15]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[16]  Veli Mäkinen,et al.  Diploid Alignments and Haplotyping , 2015, ISBRA.

[17]  Veli Mäkinen,et al.  Recombination-aware alignment of diploid individuals , 2014, BMC Genomics.

[18]  Carsten Lund,et al.  Proof verification and hardness of approximation problems , 1992, Proceedings., 33rd Annual Symposium on Foundations of Computer Science.

[19]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[20]  David Maier,et al.  The Complexity of Some Problems on Subsequences and Supersequences , 1978, JACM.

[21]  Carsten Lund,et al.  Proof verification and the hardness of approximation problems , 1998, JACM.