2× genomes - depth does matter

BackgroundGiven the availability of full genome sequences, mapping gene gains, duplications, and losses during evolution should theoretically be straightforward. However, this endeavor suffers from overemphasis on detecting conserved genome features, which in turn has led to sequencing multiple eutherian genomes with low coverage rather than fewer genomes with high-coverage and more even distribution in the phylogeny. Although limitations associated with analysis of low coverage genomes are recognized, they have not been quantified.ResultsHere, using recently developed comparative genomic application systems, we evaluate the impact of low-coverage genomes on inferences pertaining to gene gains and losses when analyzing eukaryote genome evolution through gene duplication. We demonstrate that, when performing inference of genome content evolution, low-coverage genomes generate not only a massive number of false gene losses, but also striking artifacts in gene duplication inference, especially at the most recent common ancestor of low-coverage genomes. We show that the artifactual gains are caused by the low coverage of genome sequence per se rather than by the increased taxon sampling in a biased portion of the species tree.ConclusionsWe argue that it will remain difficult to differentiate artifacts from true changes in modes and tempo of genome evolution until there is better homogeneity in both taxon sampling and high-coverage sequencing. This is important for broadening the utility of full genome data to the community of evolutionary biologists, whose interests go well beyond widely conserved physiologies and developmental patterns as they seek to understand the generative mechanisms underlying biological diversity.

[1]  Joshua M. Stuart,et al.  Genome 10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species. , 2009, The Journal of heredity.

[2]  Gang Liu,et al.  Automatic clustering of orthologs and inparalogs shared by multiple proteomes , 2006, ISMB.

[3]  Andreas Prlic,et al.  Ensembl 2007 , 2006, Nucleic Acids Res..

[4]  G. Petsko My worries are no longer behind me , 2007, Genome Biology.

[5]  Andrew M. Jenkinson,et al.  Ensembl 2009 , 2008, Nucleic Acids Res..

[6]  Athanasia C. Tzika,et al.  Escaping the mouse trap: the selection of new Evo-Devo model species. , 2007, Journal of experimental zoology. Part B, Molecular and developmental evolution.

[7]  Robert C. Edgar,et al.  MUSCLE: a multiple sequence alignment method with reduced time and space complexity , 2004, BMC Bioinformatics.

[8]  S. Turner,et al.  Real-time DNA sequencing from single polymerase molecules. , 2010, Methods in enzymology.

[9]  Sean R. Eddy,et al.  A simple algorithm to infer gene duplication and speciation events on a gene tree , 2001, Bioinform..

[10]  Steven Maere,et al.  The gain and loss of genes during 600 million years of vertebrate evolution , 2006, Genome Biology.

[11]  M. Benton,et al.  Paleontological evidence to date the tree of life. , 2006, Molecular biology and evolution.

[12]  Joaquín Dopazo,et al.  PhylomeDB: a database for genome-wide collections of gene phylogenies , 2007, Nucleic Acids Res..

[13]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[14]  Hanlee P. Ji,et al.  Next-generation DNA sequencing , 2008, Nature Biotechnology.

[15]  Joaquín Dopazo,et al.  ETE: a python Environment for Tree Exploration , 2010, BMC Bioinformatics.

[16]  Athanasia C. Tzika,et al.  Historical Constraints on Vertebrate Genome Evolution , 2009, Genome biology and evolution.

[17]  G. Gilbert,et al.  THE NEW VIEW OF ANIMAL PHYLOGENY , 2005 .

[18]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[19]  O. Gascuel,et al.  A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. , 2003, Systematic biology.

[20]  Nikos Kyrpides,et al.  The Genomes On Line Database (GOLD) v.2: a monitor of genome projects worldwide , 2005, Nucleic Acids Res..

[21]  T. Gabaldón Large-scale assignment of orthology: back to phylogenetics? , 2008, Genome Biology.

[22]  M. Stanhope,et al.  Molecules consolidate the placental mammal tree. , 2004, Trends in ecology & evolution.

[23]  H. Akaike A new look at the statistical model identification , 1974 .

[24]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[25]  Chun Jimmie Ye,et al.  Orthologous Repeats and Mammalian Phylogenetic Inference , 2005 .

[26]  J. Dopazo,et al.  The human phylome , 2007, Genome Biology.

[27]  P. Green 2x genomes--does depth matter? , 2007, Genome research.

[28]  Albert J. Vilella,et al.  EnsemblCompara GeneTrees: Complete, duplication-aware phylogenetic trees in vertebrates. , 2009, Genome research.

[29]  C. Stoeckert,et al.  OrthoMCL: identification of ortholog groups for eukaryotic genomes. , 2003, Genome research.

[30]  Ewan Birney,et al.  Ensembl Genome Browser , 2010 .

[31]  Toni Gabaldón,et al.  trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses , 2009, Bioinform..

[32]  David Q. Matus,et al.  Broad phylogenomic sampling improves resolution of the animal tree of life , 2008, Nature.

[33]  S. Turner,et al.  Real-Time DNA Sequencing from Single Polymerase Molecules , 2009, Science.

[34]  Yves Van de Peer,et al.  MANTIS: a phylogenetic framework for multi-species genome comparisons , 2008, Bioinform..