Filling Scaffolds with Gene Repetitions: Maximizing the Number of Adjacencies

In genome sequencing there is a trend not to complete the sequence of the whole genomes. Motivated by this Munoz et al. recently studied the (one-sided) problem of filling an incomplete multichromosomal genome (or scaffold) H with respect to a complete target genome C such that the resulting genomic (or double-cut-and-join, DCJ for short) distance between H′ and C is minimized, where H′ is the corresponding filled scaffold. Jiang et al. recently extended this result to both the breakpoint distance and the DCJ distance and to the (two-sided) case when even C has some missing genes, and solved all these problems in polynomial time. However, when H and C contain duplicated genes, the corresponding breakpoint distance problem becomes NP-complete and there has been no efficient approximation or FPT algorithms for it. In this paper, we mainly consider the one-sided problem of filling scaffolds with gene repetitions so as to maximize the number of adjacencies between the two resulting sequences; namely, given an incomplete genome I and a complete genome G, both with gene repetitions, fill in the missing genes to obtain I′ such that the number of adjacencies between I′ and G is maximized. We prove that this problem is also NP-complete and present an efficient 1.33-approximation for the problem. The hardness result also holds for the two-sided problem for which a trivial factor-2 approximation exists. We also present FPT algorithms for some special cases of this problem.

[1]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[2]  Petr Kolman,et al.  Minimum Common String Partition Problem: Hardness and Approximations , 2004, Electron. J. Comb..

[3]  Sorin C. Popescu,et al.  Lidar Remote Sensing , 2011 .

[4]  W. Ewens,et al.  The chromosome inversion problem , 1982 .

[5]  Rita Casadio,et al.  Algorithms in Bioinformatics, 5th International Workshop, WABI 2005, Mallorca, Spain, October 3-6, 2005, Proceedings , 2005, WABI.

[6]  Guillaume Fertin,et al.  The ExemplarBreakpointDistancefor Non-trivial Genomes Cannot Be Approximated , 2009, WALCOM.

[7]  Marek Chrobak,et al.  The greedy algorithm for the minimum common string partition problem , 2005, TALG.

[8]  Shi Ying,et al.  Frontiers in Algorithmics , 2010, Lecture Notes in Computer Science.

[9]  Hong Zhu,et al.  Minimum common string partition revisited , 2010, J. Comb. Optim..

[10]  Michael R. Fellows,et al.  Parameterized Complexity , 1998 .

[11]  David Sankoff,et al.  Genome rearrangement with gene families , 1999, Bioinform..

[12]  Richard Friedberg,et al.  Efficient sorting of genomic permutations by translocation, inversion and block interchange , 2005, Bioinform..

[13]  Bin Fu,et al.  The Approximability of the Exemplar Breakpoint Distance Problem , 2006, AAIM.

[14]  Glenn Tesler,et al.  Efficient algorithms for multichromosomal genome rearrangements , 2002, J. Comput. Syst. Sci..

[15]  Bin Fu,et al.  Non-breaking Similarity of Genomes with Gene Repetitions , 2007, CPM.

[16]  B. Birren,et al.  Genome Project Standards in a New Era of Sequencing , 2009, Science.

[17]  Minghui Jiang The Zero Exemplar Distance Problem , 2011, J. Comput. Biol..

[18]  Tao Jiang,et al.  Computing the Assignment of Orthologous Genes via Genome Rearrangement , 2005, APBC.

[19]  Peter Damaschke,et al.  Minimum Common String Partition Parameterized , 2008, WABI.

[20]  David Sankoff,et al.  Scaffold filling, contig fusion and comparative gene order inference , 2010, BMC Bioinformatics.

[21]  Jörg Flum,et al.  Parameterized Complexity Theory (Texts in Theoretical Computer Science. An EATCS Series) , 2006 .

[22]  David Sankoff,et al.  Scaffold Filling under the Breakpoint Distance , 2010, RECOMB-CG.

[23]  Costas S. Iliopoulos,et al.  A New Efficient Algorithm for Computing the Longest Common Subsequence , 2007, AAIM.

[24]  Haim Kaplan,et al.  The greedy algorithm for edit distance with moves , 2006, Inf. Process. Lett..

[25]  Bin Fu,et al.  On the inapproximability of the exemplar conserved interval distance problem of genomes , 2008, J. Comb. Optim..

[26]  Jörg Flum,et al.  Parameterized Complexity Theory , 2006, Texts in Theoretical Computer Science. An EATCS Series.