Extensive Error in the Number of Genes Inferred from Draft Genome Assemblies

Current sequencing methods produce large amounts of data, but genome assemblies based on these data are often woefully incomplete. These incomplete and error-filled assemblies result in many annotation errors, especially in the number of genes present in a genome. In this paper we investigate the magnitude of the problem, both in terms of total gene number and the number of copies of genes in specific families. To do this, we compare multiple draft assemblies against higher-quality versions of the same genomes, using several new assemblies of the chicken genome based on both traditional and next-generation sequencing technologies, as well as published draft assemblies of chimpanzee. We find that upwards of 40% of all gene families are inferred to have the wrong number of genes in draft assemblies, and that these incorrect assemblies both add and subtract genes. Using simulated genome assemblies of Drosophila melanogaster, we find that the major cause of increased gene numbers in draft genomes is the fragmentation of genes onto multiple individual contigs. Finally, we demonstrate the usefulness of RNA-Seq in improving the gene annotation of draft assemblies, largely by connecting genes that have been fragmented in the assembly process.

[1]  G. Hong,et al.  Nucleic Acids Research , 2015, Nucleic Acids Research.

[2]  Colin N. Dewey,et al.  Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution , 2004, Nature.

[3]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[4]  Xiongfei Zhang,et al.  Limitations of the rhesus macaque draft genome assembly and annotation , 2012, BMC Genomics.

[5]  Sofia M. C. Robb,et al.  MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. , 2007, Genome research.

[6]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[7]  Loretta Auvil,et al.  The yak genome and adaptation to life at high altitude , 2012, Nature Genetics.

[8]  R. Durbin,et al.  Efficient de novo assembly of large genomes using compressed data structures. , 2012, Genome research.

[9]  Steven J. M. Jones,et al.  Physical map-assisted whole-genome shotgun sequence assemblies. , 2006, Genome research.

[10]  Paul C. Leyland,et al.  FlyBase: improvements to the bibliography , 2012, Nucleic Acids Res..

[11]  Anders Krogh,et al.  farming suggests key adaptations to advanced social life and fungus Acromyrmex echinatior The genome of the leaf-cutting ant Material Supplemental , 2011 .

[12]  M. Berriman,et al.  REAPR: a universal tool for genome assembly evaluation , 2013, Genome Biology.

[13]  Makedonka Mitreva,et al.  A vertebrate case study of the quality of assemblies derived from next-generation sequences , 2011, Genome Biology.

[14]  Jean L. Chang,et al.  Initial sequence of the chimpanzee genome and comparison with the human genome , 2005, Nature.

[15]  Laxmi Parida,et al.  Assessing pooled BAC and whole genome shotgun strategies for assembly of complex genomes , 2011, BMC Genomics.

[16]  Wonhee Jang,et al.  Linking the human cytogenetic map with nucleotide sequence: the CCAP clone set. , 2006, Cancer genetics and cytogenetics.

[17]  Yuan Zhang,et al.  Genomic analyses identify distinct patterns of selection in domesticated pigs and Tibetan wild boars , 2013, Nature Genetics.

[18]  Inanç Birol,et al.  Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species , 2013, GigaScience.

[19]  V. Solovyev,et al.  Ab initio gene finding in Drosophila genomic DNA. , 2000, Genome research.

[20]  Peter A. Meric,et al.  Lineage-Specific Biology Revealed by a Finished Genome Assembly of the Mouse , 2009, PLoS biology.

[21]  Adam M. Phillippy,et al.  Comparative genome assembly , 2004, Briefings Bioinform..

[22]  Anton J. Enright,et al.  An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[23]  Albee Y. Ling,et al.  The Paleozoic Origin of Enzymatic Lignin Decomposition Reconstructed from 31 Fungal Genomes , 2012, Science.

[24]  Tomas W. Fitzgerald,et al.  Origins and functional impact of copy number variation in the human genome , 2010, Nature.

[25]  Gautier Koscielny,et al.  Ensembl 2012 , 2011, Nucleic Acids Res..

[26]  Steven Salzberg,et al.  Gene-Boosted Assembly of a Novel Bacterial Genome from Very Short Reads , 2008, PLoS Comput. Biol..

[27]  Haixu Tang,et al.  A machine-learning approach to combined evidence validation of genome assemblies , 2008, Bioinform..

[28]  Dan Graur,et al.  Finding the missing honey bee genes: lessons learned from a genome upgrade , 2014, BMC Genomics.

[29]  Bindu Nanduri,et al.  RNA-Seq-based transcriptional map of the bovine respiratory disease pathogen Histophilus somni 2336 , 2011, Genome Biology.

[30]  Kristian Stevens,et al.  Genome-wide analysis of retrogene polymorphisms in Drosophila melanogaster. , 2011, Genome research.

[31]  Mira V. Han,et al.  Estimating gene gain and loss rates in the presence of error in genome assembly and annotation using CAFE 3. , 2013, Molecular biology and evolution.

[32]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[33]  Shigehiro Kuraku,et al.  Comparative genomics approach to detecting split-coding regions in a low-coverage genome: lessons from the chimaera Callorhinchus milii (Holocephali, Chondrichthyes) , 2011, Briefings Bioinform..

[34]  S. Salzberg,et al.  Genome Assembly Has a Major Impact on Gene Content: A Comparison of Annotation in Two Bos Taurus Assemblies , 2011, PloS one.

[35]  C. Currie,et al.  Gene fragmentation in bacterial draft genomes: extent, consequences and mitigation , 2012, BMC Genomics.

[36]  I-Min A. Chen,et al.  The Genomes OnLine Database (GOLD) v.4: status of genomic and metagenomic projects and their associated metadata , 2011, Nucleic Acids Res..

[37]  Kui Lin,et al.  RNA-Seq improves annotation of protein-coding genes in the cucumber genome , 2011, BMC Genomics.

[38]  J A Eisen,et al.  Microbial Genes in the Human Genome: Lateral Transfer or Gene Loss? , 2001, Science.

[39]  Colin N. Dewey,et al.  Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures , 2007, Nature.

[40]  Qunfeng Dong,et al.  xGDB: open-source computational infrastructure for the integrated evaluation and analysis of genome features , 2006, Genome Biology.

[41]  T. Matise,et al.  Nucleotide Sequence Database Policies , 2002, Science.

[42]  Jiongtang Li,et al.  L_RNA_scaffolder: scaffolding genomes with transcripts , 2013, BMC Genomics.

[43]  Volker Brendel,et al.  yrGATE: a web-based gene-structure annotation tool for the identification and dissemination of eukaryotic genes , 2006, Genome Biology.

[44]  James C. Costello,et al.  Limitations of Pseudogenes in Identifying Gene Losses , 2008, RECOMB-CG.

[45]  E. Eichler,et al.  Limitations of next-generation genome sequence assembly , 2011, Nature Methods.

[46]  M. Schatz,et al.  Genome assembly forensics: finding the elusive mis-assembly , 2008, Genome Biology.

[47]  Keith Bradnam,et al.  Assessing the gene space in draft genomes , 2008, Nucleic acids research.

[48]  Sudhir Kumar,et al.  Comparative Genomics in Eukaryotes , 2005 .

[49]  Jane Rogers,et al.  Lessons learned from the initial sequencing of the pig genome: comparative analysis of an 8 Mb region of pig chromosome 17 , 2007, Genome Biology.

[50]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[51]  George Newport,et al.  The diploid genome sequence of Candida albicans. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[52]  Ali Mortazavi,et al.  Scaffolding a Caenorhabditis nematode genome with RNA-seq. , 2010, Genome research.

[53]  Keith Bradnam,et al.  CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes , 2007, Bioinform..

[54]  Fernando A. Villanea,et al.  Diet and the evolution of human amylase gene copy number variation , 2007, Nature Genetics.

[55]  Mihaela M. Martis,et al.  A physical, genetic and functional sequence assembly of the barley genome , 2012, Nature.

[56]  Todd H. Oakley,et al.  The Ecoresponsive Genome of Daphnia pulex , 2011, Science.

[57]  B. Graveley The developmental transcriptome of Drosophila melanogaster , 2010, Nature.

[58]  Don G. Gilbert,et al.  wFleaBase: the Daphnia genome database , 2005, BMC Bioinformatics.

[59]  Jeffery P. Demuth,et al.  The Evolution of Mammalian Gene Families , 2006, PloS one.

[60]  Jill P Mesirov,et al.  Assembly of polymorphic genomes: algorithms and application to Ciona savignyi. , 2005, Genome research.

[61]  Jian Wang,et al.  The Genome Sequence of the Malaria Mosquito Anopheles gambiae , 2002, Science.

[62]  James C. Costello,et al.  All Human-specific Gene Losses Are Present in the Genome as Pseudogenes , 2022 .

[63]  F. Denoeud,et al.  Annotating genomes with massive-scale RNA sequencing , 2008, Genome Biology.

[64]  Mira V. Han,et al.  Gene Family Evolution across 12 Drosophila Genomes , 2007, PLoS genetics.

[65]  International Human Genome Sequencing Consortium Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution , 2004 .

[66]  E. Pennisi A Low Number Wins the GeneSweep Pool , 2003, Science.

[67]  Felipe Zapata,et al.  Toward a statistically explicit understanding of de novo sequence assembly , 2013, Bioinform..

[68]  Albert J. Vilella,et al.  Considerations for the inclusion of 2x mammalian genomes in phylogenetic analyses , 2011, Genome Biology.

[69]  Matthew W. Hahn,et al.  Sequencing, Assembling, and Correcting Draft Genomes Using Recombinant Populations , 2014, G3: Genes, Genomes, Genetics.

[70]  I-Min A. Chen,et al.  The Genomes On Line Database (GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata , 2007, Nucleic Acids Res..

[71]  Manolis Kellis,et al.  Error and Error Mitigation in Low-Coverage Genome Assemblies , 2011, PloS one.

[72]  C. Ponting,et al.  Genome assembly quality: assessment and improvement using the neutral indel model. , 2010, Genome research.

[73]  Mario Stanke,et al.  Gene prediction with a hidden Markov model and a new intron submodel , 2003, ECCB.

[74]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[75]  Justin O. Borevitz,et al.  Natural Selection Shapes Genome-Wide Patterns of Copy-Number Polymorphism in Drosophila melanogaster , 2008, Science.