A Linear Time Approximation Algorithm for the DCJ Distance for Genomes with Bounded Number of Duplicates

Rearrangements are large-scale mutations in genomes, responsible for complex changes and structural variations. Most rearrangements that modify the organization of a genome can be represented by the double cut and join (DCJ) operation. Given two genomes with the same content, so that we have exactly the same number of copies of each gene in each genome, we are interested in the problem of computing the rearrangement distance between them, i.e., finding the minimum number of DCJ operations that transform one genome into the other. We propose a linear time approximation algorithm with approximation factor O(k) for the DCJ distance problem, where k is the maximum number of duplicates of any gene in the input genomes. Our algorithm uses as an intermediate step an O(k)-approximation for the minimum common string partition problem, which is closely related to the DCJ distance problem. Experiments on simulated data sets show that the algorithm is very competitive both in efficiency and quality of the solutions.

[1]  Yu Lin,et al.  Approximating the edit distance for genomes with duplicate genes under DCJ, insertion and deletion , 2012, BMC Bioinformatics.

[2]  Jens Stoye,et al.  The Solution Space of Sorting by DCJ , 2010, J. Comput. Biol..

[3]  Pavel A. Pevzner,et al.  Transforming men into mice (polynomial algorithm for genomic distance problem) , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[4]  Richard Friedberg,et al.  DCJ Path Formulation for Genome Transformations which Include Insertions, Deletions, and Duplications , 2009, J. Comput. Biol..

[5]  Bernard M. E. Moret,et al.  An Exact Algorithm to Compute the Double-Cut-and-Join Distance for Genomes with Duplicate Genes , 2015, J. Comput. Biol..

[6]  Richard Friedberg,et al.  Efficient sorting of genomic permutations by translocation, inversion and block interchange , 2005, Bioinform..

[7]  Jens Stoye,et al.  A Unifying View of Genome Rearrangements , 2006, WABI.

[8]  Petr Kolman,et al.  Minimum Common String Partition Problem: Hardness and Approximations , 2004, Electron. J. Comb..

[9]  Petr Kolman,et al.  Reversal Distance for Strings with Duplicates: Linear Time Approximation using Hitting Set , 2006, Electron. J. Comb..

[10]  Krister M. Swenson,et al.  Approximating the true evolutionary distance between two genomes , 2008, JEAL.

[11]  M. Farach Optimal suffix tree construction with large alphabets , 1997, Proceedings 38th Annual Symposium on Foundations of Computer Science.

[12]  David Sankoff,et al.  Scaffold Filling under the Breakpoint Distance , 2010, RECOMB-CG.

[13]  D. Bryant The Complexity of Calculating Exemplar Distances , 2000 .

[14]  Guillaume Fertin,et al.  A Pseudo-Boolean Framework for Computing Rearrangement Distances between Genomes with Duplicates , 2007, J. Comput. Biol..

[15]  Guillaume Fertin,et al.  Efficient Tools for Computing the Number of Breakpoints and the Number of Adjacencies between Two Genomes with Duplicate Genes , 2008, J. Comput. Biol..