New Genome Similarity Measures Based on Conserved Gene Adjacencies

Many important questions in molecular biology, evolution, and biomedicine can be addressed by comparative genomic approaches. One of the basic tasks when comparing genomes is the definition of measures of similarity (or dissimilarity) between two genomes, for example, to elucidate the phylogenetic relationships between species. The power of different genome comparison methods varies with the underlying formal model of a genome. The simplest models impose the strong restriction that each genome under study must contain the same genes, each in exactly one copy. More realistic models allow several copies of a gene in a genome. One speaks of gene families, and comparative genomic methods that allow this kind of input are called gene family-based. The most powerful-but also most complex-models avoid this preprocessing of the input data and instead integrate the family assignment within the comparative analysis. Such methods are called gene family-free. In this article, we study an intermediate approach between family-based and family-free genomic similarity measures. Introducing this simpler model, called gene connections, we focus on the combinatorial aspects of gene family-free genome comparison. While in most cases, the computational costs to the general family-free case are the same, we also find an instance where the gene connections model has lower complexity. Within the gene connections model, we define three variants of genomic similarity measures that have different expression powers. We give polynomial-time algorithms for two of them, while we show NP-hardness for the third, most powerful one. We also generalize the measures and algorithms to make them more robust against recent local disruptions in gene order. Our theoretical findings are supported by experimental results, proving the applicability and performance of our newly defined similarity measures.

[1]  David Sankoff,et al.  Generalized Gene Adjacencies, Graph Bandwidth, and Clusters in Yeast Evolution , 2008, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[2]  Jens Stoye,et al.  On the family-free DCJ distance and similarity , 2015, Algorithms for Molecular Biology.

[3]  Daniel Dörr,et al.  Gene family-free genome comparison , 2016 .

[4]  Daniel Doerr,et al.  Identifying gene clusters by discovering common intervals in indeterminate strings , 2014, BMC Genomics.

[5]  Catia Pesquita,et al.  Metrics for GO based protein semantic similarity: a systematic evaluation , 2008, BMC Bioinformatics.

[6]  T. Sakurai,et al.  Genome sequence of the palaeopolyploid soybean , 2010, Nature.

[7]  Guillaume Fertin,et al.  Efficient Tools for Computing the Number of Breakpoints and the Number of Adjacencies between Two Genomes with Duplicate Genes , 2008, J. Comput. Biol..

[8]  D. Bryant The Complexity of Calculating Exemplar Distances , 2000 .

[9]  Christophe Klopp,et al.  High-resolution genetic maps of Eucalyptus improve Eucalyptus grandis genome assembly. , 2015, The New phytologist.

[10]  Daniel Doerr,et al.  Orthology Detection Combining Clustering and Synteny for Very Large Datasets , 2014, PloS one.

[11]  Rod A Wing,et al.  A reference genome for common bean and genome-wide analysis of dual domestications , 2014, Nature Genetics.

[12]  Laurent Bulteau,et al.  Inapproximability of (1, 2)-Exemplar Distance , 2012, ISBRA.

[13]  Henry D. Priest,et al.  The genome of woodland strawberry (Fragaria vesca) , 2011, Nature Genetics.

[14]  Erik L. L. Sonnhammer,et al.  InParanoid 8: orthology analysis between 273 proteomes, mostly eukaryotic , 2014, Nucleic Acids Res..

[15]  David M. Goodstein,et al.  Phytozome: a comparative platform for green plant genomics , 2011, Nucleic Acids Res..

[16]  Adi Doron-Faigenboim,et al.  Ecology, Evolution and Organismal Biology Publications Ecology, Evolution and Organismal Biology Repeated Polyploidization of Gossypium Genomes and the Evolution of Spinnable Cotton Fibres , 2022 .

[17]  B. Haas,et al.  Draft genome sequence of the oilseed species Ricinus communis , 2010, Nature Biotechnology.

[18]  Ning Ma,et al.  BLAST+: architecture and applications , 2009, BMC Bioinformatics.

[19]  Mathieu Blanchette,et al.  The Capsella rubella genome and the genomic consequences of rapid mating system evolution , 2013, Nature Genetics.

[20]  Richard M. Karp,et al.  A n^5/2 Algorithm for Maximum Matchings in Bipartite Graphs , 1971, SWAT.

[21]  Laxmi Parida,et al.  The genome sequence of the most widely cultivated cacao type and its use to identify candidate genes regulating pod color , 2013, Genome Biology.

[22]  Alvaro J. González,et al.  The Medicago Genome Provides Insight into the Evolution of Rhizobial Symbioses , 2011, Nature.

[23]  Christina E. Wells,et al.  The high-quality draft genome of peach (Prunus persica) identifies unique patterns of genetic diversity, domestication and genome evolution , 2013, Nature Genetics.

[24]  Simon Hawkins,et al.  The genome of flax (Linum usitatissimum) assembled de novo from short shotgun sequence reads. , 2012, The Plant journal : for cell and molecular biology.

[25]  D. Sankoff,et al.  Gene Order Breakpoint Evidence in Animal Mitochondrial Phylogeny , 1999, Journal of Molecular Evolution.

[26]  Pavel A. Pevzner,et al.  Transforming Cabbage into Turnip: Polynomial Algorithm for Sorting Signed Permutations by Reversals , 1999, J. ACM.

[27]  Deqiang Zhang,et al.  Populus endo-β-1,4-glucanases gene family: genomic organization, phylogenetic analysis, expression profiles and association mapping , 2015, Planta.

[28]  Xin Chen,et al.  Assignment of orthologous genes via genome rearrangement , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[29]  Zhenyu Yang,et al.  Natural Parameter Values for Generalized Gene Adjacency , 2009, RECOMB-CG.

[30]  J. Poulain,et al.  The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla , 2007, Nature.

[31]  David Sankoff,et al.  Multichromosomal median and halving problems under different genomic distances , 2009, BMC Bioinformatics.

[32]  Jens Stoye,et al.  A new linear time algorithm to compute the genomic distance via the double cut and join distance , 2009, Theor. Comput. Sci..

[33]  David Sankoff,et al.  Genome rearrangement with gene families , 1999, Bioinform..

[34]  Alexandros Stamatakis,et al.  RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models , 2006, Bioinform..

[35]  Richard Friedberg,et al.  Efficient sorting of genomic permutations by translocation, inversion and block interchange , 2005, Bioinform..

[36]  Tanya Z. Berardini,et al.  The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools , 2011, Nucleic Acids Res..

[37]  Daniel Doerr,et al.  Gene family assignment-free comparative genomics , 2012, BMC Bioinformatics.

[38]  Simon Prochnik,et al.  The Reference Genome of the Halophytic Plant Eutrema salsugineum , 2013, Front. Plant Sci..

[39]  David Sankoff,et al.  Edit Distances for Genome Comparisons Based on Non-Local Operations , 1992, CPM.

[40]  Richard M. Karp,et al.  A n^5/2 Algorithm for Maximum Matchings in Bipartite Graphs , 1971, SWAT.

[41]  Andrea Zuccolo,et al.  Sequencing of diverse mandarin, pummelo and orange genomes reveals complex history of admixture during citrus domestication , 2014, Nature Biotechnology.

[42]  Vasco M. Manquinho,et al.  Computing the Summed Adjacency Disruption Number between Two Genomes with Duplicate Genes , 2010, J. Comput. Biol..

[43]  Pavel A. Pevzner,et al.  Transforming cabbage into turnip: polynomial algorithm for sorting signed permutations by reversals , 1995, JACM.

[44]  Daniel Doerr,et al.  The Potential of Family-Free Genome Comparison , 2013, Models and Algorithms for Genome Evolution.

[45]  Yu Lin,et al.  Maximum Likelihood Phylogenetic Reconstruction from High-Resolution Whole-Genome Data and a Tree of 68 Eukaryotes , 2012, Pacific Symposium on Biocomputing.

[46]  Bernard M. E. Moret,et al.  An Exact Algorithm to Compute the Double-Cut-and-Join Distance for Genomes with Duplicate Genes , 2015, J. Comput. Biol..