论文信息 - Locating rearrangement events in a phylogeny based on highly fragmented assemblies

Locating rearrangement events in a phylogeny based on highly fragmented assemblies

BackgroundThe inference of genome rearrangement operations requires complete genome assemblies as input data, since a rearrangement can involve an arbitrarily large proportion of one or more chromosomes. Most genome sequence projects, especially those on non-model organisms for which no physical map exists, produce very fragmented assembles, so that a rearranged fragment may be impossible to identify because its two endpoints are on different scaffolds. However, breakpoints are easily identified, as long as they do not coincide with scaffold ends. For the phylogenetic context, in comparing a fragmented assembly with a number of complete assemblies, certain combinatorial constraints on breakpoints can be derived. We ask to what extent we can use breakpoint data between a fragmented genome and a number of complete genomes to recover all the arrangements in a phylogeny.ResultsWe simulate genomic evolution via chromosomal inversion, fragmenting one of the genomes into a large number of scaffolds to represent the incompleteness of assembly. We identify all the breakpoints between this genome and the remainder. We devise an algorithm which takes these breakpoints into account in trying to determine on which branch of the phylogeny a rearrangement event occurred. We present an analysis of the dependence of recovery rates on scaffold size and rearrangement rate, and show that the true tree, the one on which the rearrangement simulation was performed, tends to be most parsimonious in estimating the number of true events inferred.ConclusionsIt is somewhat surprising that the breakpoints identified just between the fragmented genome and each of the others suffice to recover most of the rearrangements produced by the simulations. This holds even in parts of the phylogeny disjoint from the lineage of the fragmented genome.

David Sankoff | Chunfang Zheng | D. Sankoff | Chunfang Zheng

[1] Michael Freeling,et al. Genomic duplication, fractionation and the origin of regulatory novelty. , 2004, Genetics.

[2] Haibao Tang,et al. Finding and Comparing Syntenic Regions among Arabidopsis and the Outgroups Papaya, Poplar, and Grape: CoGe with Rosids1[W] , 2008, Plant Physiology.

[3] Guillaume Fertin,et al. Combinatorics of Genome Rearrangements , 2009, Computational molecular biology.

[4] David Sankoff,et al. Rearrangement Phylogeny of Genomes in Contig Form , 2009, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[5] M. Freeling,et al. How to usefully compare homologous plant genes and chromosomes as DNA sequences. , 2008, The Plant journal : for cell and molecular biology.

[6] D. Penny. Inferring Phylogenies.—Joseph Felsenstein. 2003. Sinauer Associates, Sunderland, Massachusetts. , 2004 .

[7] P. Pevzner,et al. Human and mouse genomic sequences reveal extensive breakpoint reuse in mammalian evolution , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[8] David Sankoff,et al. Scaffold filling, contig fusion and comparative gene order inference , 2010, BMC Bioinformatics.

[9] B. Birren,et al. Genome Project Standards in a New Era of Sequencing , 2009, Science.