A General Framework for Computing Rearrangement Distances between Genomes with Duplicates

Computing genomic distances between whole genomes is a fundamental problem in comparative genomics. Recent researches have resulted in different genomic distance definitions: number of breakpoints, number of common intervals, number of conserved intervals, Maximum Adjacency Disruption number (MAD), etc. Unfortunately, it turns out that, in presence of duplications, most problems are NP-hard, and hence several heuristics have been recently proposed. However, while it is relatively easy to compare heuristics between them, until now very little is known about the absolute accuracy of these heuristics. Therefore, there is a great need for algorithmic approaches that compute exact solutions for these genomic distances. In this paper, we present a novel generic pseudo-boolean approach for computing the exact genomic distance between two whole genomes in presence of duplications, and put strong emphasis on common intervals under the maximum matching model. Of particular importance, we show three heuristics which provide very good results on a well-known public dataset of gamma-Proteobacteria.

[1]  Petr Kolman,et al.  Reversal Distance for Strings with Duplicates: Linear Time Approximation using Hitting Set , 2006, Electron. J. Comb..

[2]  Alexander Schrijver,et al.  Theory of linear and integer programming , 1986, Wiley-Interscience series in discrete mathematics and optimization.

[3]  Guillaume Fertin,et al.  Genomes Containing Duplicates Are Hard to Compare , 2006, International Conference on Computational Science.

[4]  Krister M. Swenson,et al.  A Framework for Orthology Assignment from Gene Rearrangement Data , 2005, Comparative Genomics.

[5]  Cedric Chauve,et al.  Genes Order and Phylogenetic Reconstruction: Application to -Proteobacteria , 2005 .

[6]  Takeaki Uno,et al.  Fast Algorithms to Enumerate All Common Intervals of Two Permutations , 1997, Algorithmica.

[7]  Petr Kolman,et al.  Approximating reversal distance for strings with bounded number of duplicates , 2005, Discret. Appl. Math..

[8]  Gad M. Landau,et al.  Gene Proximity Analysis across Whole Genomes via PQ Trees1 , 2005, J. Comput. Biol..

[9]  Xin Chen,et al.  Assignment of orthologous genes via genome rearrangement , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[10]  Bin Fu,et al.  Lower Bounds on the Approximation of the Exemplar Conserved Interval Distance Problem of Genomes , 2006, COCOON.

[11]  Karem A. Sakallah,et al.  Pueblo: A Hybrid Pseudo-Boolean SAT Solver , 2006, J. Satisf. Boolean Model. Comput..

[12]  Niklas Sörensson,et al.  Translating Pseudo-Boolean Constraints into SAT , 2006, J. Satisf. Boolean Model. Comput..

[13]  D. Sankoff,et al.  Comparative Genomics: "Empirical And Analytical Approaches To Gene Order Dynamics, Map Alignment And The Evolution Of Gene Families" , 2000 .

[14]  Tao Jiang,et al.  A Parsimony Approach to Genome-Wide Ortholog Assignment , 2006, RECOMB.

[15]  Krister M. Swenson,et al.  Genomic Distances under Deletions and Insertions , 2003, COCOON.

[16]  D. Bryant The Complexity of Calculating Exemplar Distances , 2000 .

[17]  Krister M. Swenson,et al.  Approximating the true evolutionary distance between two genomes , 2008, JEAL.

[18]  David Sankoff,et al.  Power Boosts for Cluster Tests , 2005, Comparative Genomics.

[19]  Andreas Kuehlmann,et al.  A fast pseudo-Boolean constraint solver , 2003, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[20]  David Sankoff,et al.  Genome rearrangement with gene families , 1999, Bioinform..

[21]  G. Blin,et al.  The breakpoint distance for signed sequences , 2005 .

[22]  Nadia El-Mabrouk,et al.  Reconstructing an ancestral genome using minimum segments duplications and reversals , 2002, J. Comput. Syst. Sci..

[23]  Bin Fu,et al.  The Approximability of the Exemplar Breakpoint Distance Problem , 2006, AAIM.

[24]  Nadia El-Mabrouk,et al.  Maximizing Synteny Blocks to Identify Ancestral Homologs , 2005, Comparative Genomics.

[25]  P. Barth A Davis-Putnam based enumeration algorithm for linear pseudo-Boolean optimization , 1995 .

[26]  Marek Chrobak,et al.  The greedy algorithm for the minimum common string partition problem , 2005, TALG.

[27]  Petr Kolman,et al.  Minimum Common String Partition Problem: Hardness and Approximations , 2004, Electron. J. Comb..

[28]  N. Moran,et al.  From Gene Trees to Organismal Phylogeny in Prokaryotes:The Case of the γ-Proteobacteria , 2003, PLoS biology.

[29]  Mathieu Raffinot,et al.  Computing Common Intervals of K Permutations, with Applications to Modular Decomposition of Graphs , 2005, ESA.