Measuring the impact of gene prediction on gene loss estimates in Eukaryotes by quantifying falsely inferred absences

In recent years it became clear that in eukaryotic genome evolution gene loss is prevalent over gene gain. However, the absence of genes in an annotated genome is not always equivalent to the loss of genes. Due to sequencing issues, or incorrect gene prediction, genes can be falsely inferred as absent. This implies that loss estimates are overestimated and, more generally, that falsely inferred absences impact genomic comparative studies. However, reliable estimates of how prevalent this issue is are lacking. Here we quantified the impact of gene prediction on gene loss estimates in eukaryotes by analysing 209 phylogenetically diverse eukaryotic organisms and comparing their predicted proteomes to that of their respective six-frame translated genomes. We observe that 4.61% of domains per species were falsely inferred to be absent for Pfam domains predicted to have been present in the last eukaryotic common ancestor. Between phylogenetically different categories this estimate varies substantially: for clade-specific loss (ancestral loss) we found 1.30% and for species-specific loss 16.88% to be falsely inferred as absent. For BUSCO 1-to-1 orthologous families, 18.30% were falsely inferred to be absent. Finally, we showed that falsely inferred absences indeed impact loss estimates, with the number of losses decreasing by 11.78%. Our work strengthens the increasing number of studies showing that gene loss is an important factor in eukaryotic genome evolution. However, while we demonstrate that on average inferring gene absences from predicted proteomes is reliable, caution is warranted when inferring species-specific absences.

[1]  Thijs J. G. Ettema,et al.  Asgard archaea illuminate the origin of eukaryotic cellular complexity , 2017, Nature.

[2]  P. Bork,et al.  ETE 3: Reconstruction, Analysis, and Visualization of Phylogenomic Data , 2016, Molecular biology and evolution.

[3]  Davide Heller,et al.  eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences , 2015, Nucleic Acids Res..

[4]  Evgeny M. Zdobnov,et al.  BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs , 2015, Bioinform..

[5]  W. Martin Too Much Eukaryote LGT , 2017, BioEssays : news and reviews in molecular, cellular and developmental biology.

[6]  Paulien Hogeweg,et al.  Virtual Genomes in Flux: An Interplay of Neutrality and Adaptability Explains Genome Expansion and Streamlining , 2012, Genome biology and evolution.

[7]  Crispin J. Miller,et al.  Augmented Annotation of the Schizosaccharomyces pombe Genome Reveals Additional Genes Required for Growth and Viability , 2011, Genetics.

[8]  K. Katoh,et al.  MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability , 2013, Molecular biology and evolution.

[9]  A. Roger,et al.  Demystifying Eukaryote Lateral Gene Transfer (Response to Martin 2017 DOI: 10.1002/bies.201700115) , 2018, BioEssays : news and reviews in molecular, cellular and developmental biology.

[10]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[11]  Robert D. Finn,et al.  The Pfam protein families database: towards a more sustainable future , 2015, Nucleic Acids Res..

[12]  Keith Bradnam,et al.  CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes , 2007, Bioinform..

[13]  Eugene V Koonin,et al.  Genome reduction as the dominant mode of evolution , 2013, BioEssays : news and reviews in molecular, cellular and developmental biology.

[14]  William R Pearson,et al.  Most partial domains in proteins are alignment and annotation artifacts , 2015, Genome Biology.

[15]  Jan Pačes,et al.  Hidden genes in birds , 2015, Genome Biology.

[16]  B. Barrell,et al.  The genome sequence of Schizosaccharomyces pombe , 2002, Nature.

[17]  S. Baldauf,et al.  An Alternative Root for the Eukaryote Tree of Life , 2014, Current Biology.

[18]  Robert M. Waterhouse,et al.  BUSCO Applications from Quality Assessments to Gene Prediction and Phylogenomics , 2017, bioRxiv.

[19]  Johannes Söding,et al.  kClust: fast and sensitive clustering of large protein sequence databases , 2013, BMC Bioinformatics.

[20]  A. von Haeseler,et al.  IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies , 2014, Molecular biology and evolution.

[21]  David Bryant,et al.  Endosymbiotic origin and differential loss of eukaryotic genes , 2015, Nature.

[22]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[23]  Toni Gabaldón,et al.  trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses , 2009, Bioinform..

[24]  Jesualdo Tomás Fernández-Breis,et al.  Gearing up to handle the mosaic nature of life in the quest for orthologs , 2017, Bioinform..

[25]  Morgan Wirthlin,et al.  Conserved syntenic clusters of protein coding genes are missing in birds , 2014, Genome Biology.

[26]  M. Huynen,et al.  Practical and theoretical advances in predicting the function of a protein by its phylogenetic distribution , 2008, Journal of The Royal Society Interface.

[27]  Adam Godzik,et al.  Strong functional patterns in the evolution of eukaryotic genomes revealed by the reconstruction of ancestral protein domain repertoires , 2011, Genome Biology.

[28]  Laura Wegener Parfrey,et al.  Turning the crown upside down: gene tree parsimony roots the eukaryotic tree of life. , 2012, Systematic biology.

[29]  C. Cañestro,et al.  Evolution by gene loss , 2016, Nature Reviews Genetics.

[30]  Jose Lugo-Martinez,et al.  Extensive Error in the Number of Genes Inferred from Draft Genome Assemblies , 2014, PLoS Comput. Biol..

[31]  Marek Elias,et al.  Sculpting the endomembrane system in deep time: high resolution phylogenetics of Rab GTPases , 2012, Journal of Cell Science.

[32]  T. Cavalier-smith Kingdoms Protozoa and Chromista and the eozoan root of the eukaryotic tree , 2010, Biology Letters.

[33]  Nicolas Galtier,et al.  Avian Genomes Revisited: Hidden Genes Uncovered and the Rates versus Traits Paradox in Birds , 2017, Molecular biology and evolution.

[34]  A. von Haeseler,et al.  UFBoot2: Improving the Ultrafast Bootstrap Approximation , 2017, bioRxiv.

[35]  The draft genomes of soft-shell turtle and green sea turtle yield insights into the development and evolution of the turtle-specific body plan. , 2013, Nature genetics.